Numerical algorithms, optimisation, and computer performance: a selection of research from 2020

I won’t bore you with clichés about how weird and difficult this year has been. Instead, I’d like to point out straightaway how large this year’s selection of papers is: more than 400 papers, which is more than twice the papers I selected last year.

This year, I haven’t changed my Google Scholar query:

algorithm sparse method parallel performance matrix numerical optimization graph

but I have added some more supplemental papers, as I found fit.

In the coming year, I expect to add some papers about Computational Sustainability, a field I’m getting very interested in.

An analysis of the data

Load data and fit LDA model to find the topics.
rng('default');
textData = string(fileread("bibliography.txt"));
textData = splitlines(textData);
textData = iCleanBibliographyText(textData);
documents = iPrepareDocument(textData);
[bagTraining, docsTest] = iSplitTrainingTest(documents, 0.8);
Train a number of models, changing the number of topics.
allNumTopics = 2:20;
sizeAllNumTopics = numel(allNumTopics);
mdl = cell(sizeAllNumTopics, 1);
logProbabilities = cell(sizeAllNumTopics, 1);
meanLogProbabilities = zeros(sizeAllNumTopics, 1);
perplexities = cell(sizeAllNumTopics, 1);
meanPerplexities = zeros(sizeAllNumTopics, 1);
for k = 1:sizeAllNumTopics
rng('default');
numTopics = allNumTopics(k);
mdl{k} = fitlda(bagTraining, numTopics, 'Verbose', 0);
logProbabilities{k} = logp(mdl{k}, bagTraining, 'NumSamples', 200);
[~, perplexities{k}] = logp(mdl{k}, docsTest, 'NumSamples', 200);
meanLogProbabilities(k) = mean(logProbabilities{k});
meanPerplexities(k) = mean(perplexities{k});
end
Compare the models, plot log-probabilities over the training set and perplexities over the test set.
figure;
title("Model comparison");
subplot(1, 2, 1);
plot(allNumTopics, meanLogProbabilities);
ylabel("Log-probability on the training set");
xlabel("Number of topics");
subplot(1, 2, 2);
plot(allNumTopics, meanPerplexities);
ylabel("Perplexities on the training set");
xlabel("Number of topics");
Pick the model with the best performance on the test set.
[~, idxBestModel] = min(meanPerplexities);
bestModel = mdl{idxBestModel};
bestNumTopics = allNumTopics(idxBestModel);
iDisplayTopics(bestModel);
Topic 1: problem solution solve show order case find apply introduce approximate Topic 2: propose base many number different computational cost set good size Topic 3: numerical method first experiment accuracy fast novel two accelerate compare Topic 4: datum performance provide new require execution challenge various process feature Topic 5: model optimization learn structure network machine parameter stateoftheart present largescale Topic 6: implementation compute performance achieve gpu gpus test technique architecture design Topic 7: method problem gradient iteration complexity analysis condition optimization exist particular Topic 8: matrix memory parallel block communication distribute computation partition sparse datum Topic 9: application program code analysis performance include study hpc framework exist Topic 10: system solver large linear equation low nonlinear result example lead Topic 11: algorithm result work improve propose term vector constraint version mean Topic 12: approach time reduce work speedup well three implement experimental balance Topic 13: algorithm sparse matrix linear parallel present factorization demonstrate iterative direct Topic 14: library system paper scientific software hardware simulation highperformance spmv application Topic 15: tensor result tool paper show two decomposition factor guarantee input Topic 16: convergence function point convex rate global stochastic local bound class Topic 17: error compute precision runtime operation scale algebra computation overhead arithmetic Topic 18: graph cluster algorithm spectral random nod sequence edge quadratic match
iDisplayLogProbability(logProbabilities{idxBestModel});
Show how the best model categorises all the documents
topicMixture = transform(bestModel, docsTest);
iDisplayTopicMixtures(topicMixture, bestNumTopics);
Compare the topics found by the model
allTopicMixtures = transform(bestModel, documents);
iDisplayClosenness(allTopicMixtures);

Helper functions

function cleanDocuments = iCleanBibliographyText(document)
% Clean the input array of strings, making sure each string
% corresponds to a bibliography entry.
stringSizes = strlength(document);
document(stringSizes==0) = [];
isAbstract = contains(document, "Abstract:");
idxAbstract = find(isAbstract);
% Previous abstract ends two rows before the new.
idxEndAbstract = [idxAbstract-2; numel(document)];
idxEndAbstract(1) = [];
cleanDocuments = repmat("", [numel(idxAbstract), 1]);
% Select each abstract
for k = 1:numel(idxAbstract)
currAbstractStart = idxAbstract(k);
currAbstractEnd = idxEndAbstract(k);
cleanDocuments(k) = join(document(currAbstractStart:currAbstractEnd));
end
cleanDocuments(strlength(cleanDocuments)==0 | ismissing(cleanDocuments)) = [];
end
function documents = iPrepareDocument(textData)
% Prepare the document for analysis
documents = tokenizedDocument(textData);
documents = removeStopWords(documents);
documents = erasePunctuation(documents);
documents = removeShortWords(documents, 2);
documents = removeWords(documents, ["Abstract", "2020", "arXiv"]);
documents = normalizeWords(documents, 'Style', 'lemma');
end
function iDisplayTopics(mdl)
% Display the topics in the model
figure
numTopics = mdl.NumTopics;
for idxTopic = 1:numTopics
topicWords = join(iFindTopicWords(idxTopic), " ");
disp("Topic " + idxTopic + ": " + topicWords);
end
function words = iFindTopicWords(idxTopic)
wordProbabilities = mdl.TopicWordProbabilities(:, idxTopic);
wordList = mdl.Vocabulary;
[~, idxTopProbabilities] = maxk(wordProbabilities, 10);
words = wordList(idxTopProbabilities);
end
end
function iDisplayTopicMixtures(topicMixtures, numTopics)
% Display the probabily of topics by document
figure
area(topicMixtures);
ylim([0 1])
xlim([0, size(topicMixtures, 1)]);
title("Topic mixtures over the test set")
ylabel("Topic probability")
xlabel("Document number")
legend("Topic " + string(1:numTopics),'Location','northeastoutside')
end
function iDisplayLogProbability(logProbabilities)
% Display a log-probability diagram
figure
histogram(logProbabilities)
xlabel("Log probability")
ylabel("Frequency")
title("Document log-probabilities")
end
function iDisplayClosenness(allTopicMixtures)
% Display the closenness between
topicNorms = vecnorm(allTopicMixtures, 2);
numTopics = numel(topicNorms);
topicCloseness = (allTopicMixtures')*(allTopicMixtures)./((topicNorms')*topicNorms);
figure
currCol = pcolor(topicCloseness);
colorbar('location', 'EastOutside');
xlabel("Topic number");
ylabel("Topic number");
title("Closeness between topics");
plotAxis = currCol.Parent;
plotAxis.XTick = 1:numTopics;
plotAxis.YTick = 1:numTopics;
plotAxis.YDir = 'reverse';
pbaspect(plotAxis, [1, 1, 1]);
end
function [bagTraining, docsTest] = iSplitTrainingTest(documents, ratio)
% Split training set and test set
numDocuments = numel(documents);
numTraining = round(ratio*numDocuments);
documents = documents(randperm(numDocuments));
bagTraining = bagOfWords(documents(1:numTraining));
docsTest = documents(numTraining+1:end);
end

From the data analysis, I see:

Matching entries: 0
settings...
Abdelfattah A, Tomov S and Dongarra J (2020), "Investigating the Benefit of FP16-Enabled Mixed-Precision Solvers for Symmetric Positive Definite Matrices Using GPUs", In Lecture Notes in Computer Science. , pp. 237-250. Springer International Publishing.
Abstract: Half-precision computation refers to performing floating-point operations in a 16-bit format. While half-precision has been driven largely by machine learning applications, recent algorithmic advances in numerical linear algebra have discovered beneficial use cases for half precision in accelerating the solution of linear systems of equations at higher precisions. In this paper, we present a high-performance, mixed-precision linear solver (Ax=b) for symmetric positive definite systems in double-precision using graphics processing units (GPUs). The solver is based on a mixed-precision Cholesky factorization that utilizes the high-performance tensor core units in CUDA-enabled GPUs. Since the Cholesky factors are affected by the low precision, an iterative refinement (IR) solver is required to recover the solution back to double-precision accuracy. Two different types of IR solvers are discussed on a wide range of test matrices. A preprocessing step is also developed, which scales and shifts the matrix, if necessary, in order to preserve its positive-definiteness in lower precisions. Our experiments on the V100 GPU show that performance speedups are up to 4.7× against a direct double-precision solver. However, matrix properties such as the condition number and the eigenvalue distribution can affect the convergence rate, which would consequently affect the overall performance.
BibTeX:
@incollection{Abdelfattah2020,
  author = {Ahmad Abdelfattah and Stan Tomov and Jack Dongarra},
  title = {Investigating the Benefit of FP16-Enabled Mixed-Precision Solvers for Symmetric Positive Definite Matrices Using GPUs},
  booktitle = {Lecture Notes in Computer Science},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {237--250},
  doi = {10.1007/978-3-030-50417-5_18}
}
Abdelfattah A, Tomov S and Dongarra J (2020), "Matrix multiplication on batches of small matrices in half and half-complex precisions", Journal of Parallel and Distributed Computing., 11, 2020. Vol. 145, pp. 188-201. Elsevier BV.
Abstract: Machine learning and artificial intelligence (AI) applications often rely on performing many small matrix operations -- in particular general matrix-matrix multiplication (GEMM). These operations are usually performed in a reduced precision, such as the 16-bit floating-point format (i.e., half precision or FP16). The GEMM operation is also very important for dense linear algebra algorithms, and half-precision GEMM operations can be used in mixed-precision linear solvers. Therefore, high-performance batched GEMM operations in reduced precision are significantly important, not only for deep learning frameworks, but also for scientific applications that rely on batched linear algebra, such as tensor contractions and sparse direct solvers.\ This paper presents optimized batched GEMM kernels for graphics processing units (GPUs) in FP16 arithmetic. The paper addresses both real and complex half-precision computations on the GPU. The proposed design takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. With eight tuning parameters introduced in the design, the developed kernels have a high degree of flexibility that overcomes the limitations imposed by the hardware and software (in the form of discrete configurations for the Tensor Core APIs). For real FP16 arithmetic, performance speedups are observed against cuBLAS for sizes up to 128, and range between 1.5× and 2.5×. For the complex FP16 GEMM kernel, the speedups are between 1.7× and 7× thanks to a design that uses the standard interleaved matrix layout, in contrast with the planar layout required by the vendor's solution. The paper also discusses special optimizations for extremely small matrices, where even higher performance gains are achievable.
BibTeX:
@article{Abdelfattah2020b,
  author = {Ahmad Abdelfattah and Stanimire Tomov and Jack Dongarra},
  title = {Matrix multiplication on batches of small matrices in half and half-complex precisions},
  journal = {Journal of Parallel and Distributed Computing},
  publisher = {Elsevier BV},
  year = {2020},
  volume = {145},
  pages = {188--201},
  doi = {10.1016/j.jpdc.2020.07.001}
}
Acebrón JA (2020), "A Probabilistic Linear Solver Based on a Multilevel Monte Carlo Method", Journal of Scientific Computing., 2, 2020. Vol. 82(3) Springer Science and Business Media LLC.
Abstract: We describe a new Monte Carlo method based on a multilevel method for computing the action of the resolvent matrix over a vector. The method is based on the numerical evaluation of the Laplace transform of the matrix exponential, which is computed efficiently using a multilevel Monte Carlo method. Essentially, it requires generating suitable random paths which evolve through the indices of the matrix according to the probability law of a continuous-time Markov chain governed by the associated Laplacian matrix. The convergence of the proposed multilevel method has been discussed, and several numerical examples were run to test the performance of the algorithm. These examples concern the computation of some metrics of interest in the analysis of complex networks, and the numerical solution of a boundary-value problem for an elliptic partial differential equation. In addition, the algorithm was conveniently parallelized, and the scalability analyzed and compared with the results of other existing Monte Carlo method for solving linear algebra systems.
BibTeX:
@article{Acebron2020,
  author = {Juan A. Acebrón},
  title = {A Probabilistic Linear Solver Based on a Multilevel Monte Carlo Method},
  journal = {Journal of Scientific Computing},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  volume = {82},
  number = {3},
  doi = {10.1007/s10915-020-01168-2}
}
Acer S and Aykanat C (2020), "Reordering sparse matrices into block-diagonal column-overlapped form", Journal of Parallel and Distributed Computing., 3, 2020. Elsevier BV.
Abstract: Many scientific and engineering applications necessitate computing the minimum norm solution of a sparse underdetermined linear system of equations. The minimum 2-norm solution of such systems can be obtained by a recent parallel algorithm, whose numerical effectiveness and parallel scalability are validated in both shared- and distributed-memory architectures. This parallel algorithm assumes the coefficient matrix in a block-diagonal column-overlapped (BDCO) form, which is a variant of the block-diagonal form where the successive diagonal blocks may overlap along their columns. The total overlap size of the BDCO form is an important metric in the performance of the subject parallel algorithm since it determines the size of the reduced system, solution of which is a bottleneck operation in the parallel algorithm. In this work, we propose a hypergraph partitioning model for reordering sparse matrices into BDCO form with the objective of minimizing the total overlap size and the constraint of maintaining balance on the number of nonzeros of the diagonal blocks. Our model makes use of existing partitioning tools that support fixed vertices in the recursive bipartitioning paradigm. Experimental results validate the use of our model as it achieves small overlap size and balanced diagonal blocks.
BibTeX:
@article{Acer2020,
  author = {Seher Acer and Cevdet Aykanat},
  title = {Reordering sparse matrices into block-diagonal column-overlapped form},
  journal = {Journal of Parallel and Distributed Computing},
  publisher = {Elsevier BV},
  year = {2020},
  doi = {10.1016/j.jpdc.2020.03.002}
}
Aggarwal CC (2020), "The Linear Algebra of Similarity", In Linear Algebra and Optimization for Machine Learning. , pp. 379-410. Springer International Publishing.
Abstract: A dot-product similarity matrix is an alternative way to represent a multidimensional data set. In other words, one can convert an n × d data matrix D into an n × n similarity matrix S = DD^T (which contains n^2 pairwise dot products between points). One can use S instead of D for machine learning algorithms. The reason is that the similarity matrix contains almost the same information about the data as the original matrix. This equivalence is the genesis of a large class of methods in machine learning, referred to as kernel methods. This chapter builds the linear algebra framework required for understanding this important class of methods in machine learning. The real utility of such methods arises when the similarity matrix is chosen differently from the use of dot products (and the data matrix is sometimes not even available).
BibTeX:
@incollection{Aggarwal2020,
  author = {Charu C. Aggarwal},
  title = {The Linear Algebra of Similarity},
  booktitle = {Linear Algebra and Optimization for Machine Learning},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {379--410},
  doi = {10.1007/978-3-030-40344-7_9}
}
Agullo E, Cools S, Fatih-Yetkin E, Giraud L, Schenkel N and Vanroose W (2020), "On soft errors in the Conjugate Gradient method: sensitivity and robust numerical detection". Thesis at: INRIA.
Abstract: The conjugate gradient (CG) method is the most widely used iterative scheme for the solution of large sparse systems of linear equations when the matrix is symmetric positive definite. Although more than sixty year old, it is still a serious candidate for extreme-scale computation on large computing platforms. On the technological side, the continuous shrinking of transistor geometry and the increasing complexity of these devices affect dramatically their sensitivity to natural radiation, and thus diminish their reliability. One of the most common effects produced by natural radiation is the single event upset which consists in a bit-flip in a memory cell producing unexpected results at application level. Consequently, the future computing facilities at extreme scale might be more prone to errors of any kind including bit-flip during calculation. These numerical and technological observations are the main motivations for this work, where we first investigate through extensive numerical experiments the sensitivity of CG to bit-flips in its main computationally intensive kernels, namely the matrix-vector product and the preconditioner application. We further propose numerical criteria to detect the occurrence of such soft errors; we assess their robustness through extensive numerical experiments.
BibTeX:
@techreport{Agullo2020,
  author = {Emmanuel Agullo and Siegfried Cools and Emrullah Fatih-Yetkin and Luc Giraud and Nick Schenkel and Wim Vanroose},
  title = {On soft errors in the Conjugate Gradient method: sensitivity and robust numerical detection},
  school = {INRIA},
  year = {2020},
  url = {https://hal.inria.fr/hal-02495301}
}
Agullo E, Altenbernd M, Anzt H, Bautista-Gomez L, Benacchio T, Bonaventura L, Bungartz H-J, Chatterjee S, Ciorba FM, DeBardeleben N, Drzisga D, Eibl S, Engelmann C, Gansterer WN, Giraud L, Goeddeke D, Heisig M, Jezequel F, Kohl N, Li XS, Lion R, Mehl M, Mycek P, Obersteiner M, Quintana-Orti ES, Rizzi F, Ruede U, Schulz M, Fung F, Speck R, Stals L, Teranishi K, Thibault S, Thoennes D, Wagner A and Wohlmuth B (2020), "Resiliency in Numerical Algorithm Design for Extreme Scale Simulations", October, 2020.
Abstract: This work is based on the seminar titled ``Resiliency in Numerical Algorithm Design for Extreme Scale Simulations'' held March 1-6, 2020 at Schloss Dagstuhl, that was attended by all the authors. Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge.
BibTeX:
@article{Agullo2020a,
  author = {Emmanuel Agullo and Mirco Altenbernd and Hartwig Anzt and Leonardo Bautista-Gomez and Tommaso Benacchio and Luca Bonaventura and Hans-Joachim Bungartz and Sanjay Chatterjee and Florina M. Ciorba and Nathan DeBardeleben and Daniel Drzisga and Sebastian Eibl and Christian Engelmann and Wilfried N. Gansterer and Luc Giraud and Dominik Goeddeke and Marco Heisig and Fabienne Jezequel and Nils Kohl and Xiaoye Sherry Li and Romain Lion and Miriam Mehl and Paul Mycek and Michael Obersteiner and Enrique S. Quintana-Orti and Francesco Rizzi and Ulrich Ruede and Martin Schulz and Fred Fung and Robert Speck and Linda Stals and Keita Teranishi and Samuel Thibault and Dominik Thoennes and Andreas Wagner and Barbara Wohlmuth},
  title = {Resiliency in Numerical Algorithm Design for Extreme Scale Simulations},
  year = {2020}
}
Ahmad N, Yilmaz B and Unat D (2020), "A Prediction Framework for Fast Sparse Triangular Solves", In Euro-Par 2020: Parallel Processing. , pp. 529-545. Springer International Publishing.
Abstract: Sparse triangular solve (SpTRSV) is an important linear algebra kernel, finding extensive uses in numerical and scientific computing. The parallel implementation of SpTRSV is a challenging task due to the sequential nature of the steps involved. This makes it, in many cases, one of the most time-consuming operations in an application. Many approaches for efficient SpTRSV on CPU and GPU systems have been proposed in the literature. However, no single implementation or platform (CPU or GPU) gives the fastest solution for all input sparse matrices. In this work, we propose a machine learning-based framework to predict the SpTRSV implementation giving the fastest execution time for a given sparse matrix based on its structural features. The framework is tested with six SpTRSV implementations on a state-of-the-art CPU-GPU machine (Intel Xeon Gold CPU, NVIDIA V100 GPU). Experimental results, with 998 matrices taken from the SuiteSparse Matrix Collection, show the classifier prediction accuracy of 87% for the fastest SpTRSV algorithm for a given input matrix. Predicted SpTRSV implementations achieve average speedups (harmonic mean) in the range of 1.4--2.7× against the six SpTRSV implementations used in the evaluation.
BibTeX:
@incollection{Ahmad2020,
  author = {Najeeb Ahmad and Buse Yilmaz and Didem Unat},
  title = {A Prediction Framework for Fast Sparse Triangular Solves},
  booktitle = {Euro-Par 2020: Parallel Processing},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {529--545},
  doi = {10.1007/978-3-030-57675-2_33}
}
Ahmadi AA and Zhang J (2020), "Complexity aspects of local minima and related notions", August, 2020.
Abstract: We consider the notions of (i) critical points, (ii) second-order points, (iii) local minima, and (iv) strict local minima for multivariate polynomials. For each type of point, and as a function of the degree of the polynomial, we study the complexity of deciding (1) if a given point is of that type, and (2) if a polynomial has a point of that type. Our results characterize the complexity of these two questions for all degrees left open by prior literature. Our main contributions reveal that many of these questions turn out to be tractable for cubic polynomials. In particular, we present an efficiently-checkable necessary and sufficient condition for local minimality of a point for a cubic polynomial. We also show that a local minimum of a cubic polynomial can be efficiently found by solving semidefinite programs of size linear in the number of variables. By contrast, we show that it is strongly NP-hard to decide if a cubic polynomial has a critical point. We also prove that the set of second-order points of any cubic polynomial is a spectrahedron, and conversely that any spectrahedron is the projection of the set of second-order points of a cubic polynomial. In our final section, we briefly present a potential application of finding local minima of cubic polynomials to the design of a third-order Newton method.
BibTeX:
@article{Ahmadi2020,
  author = {Amir Ali Ahmadi and Jeffrey Zhang},
  title = {Complexity aspects of local minima and related notions},
  year = {2020}
}
Ahrens P, Demmel J and Nguyen H (2020), "Algorithms for Efficient Reproducible Floating Point Summation", ACM Transactions on Mathematical Software. New York, NY, USA Vol. 0(ja) Association for Computing Machinery.
Abstract: We define reproducibility to mean getting bitwise identical results from multiple runs of the same program, perhaps with different hardware resources or other changes that should not change the answer. Many users depend on reproducibility for debugging or correctness. However, dynamic scheduling of parallel computing resources, with nonassociativity of floating-point addition, makes reproducibility a challenge even for summing a vector of numbers, or the Basic Linear Algebra Subprograms (BLAS). We describe an algorithm that computes a reproducible floating point sum, independent of summation order. The algorithm uses only a subset of IEEE Standard 754-2008. It is communication-optimal, i.e. it does just one pass over the data in the sequential case, or one reduction operation in parallel, requiring an accumulator of just 6 words (higher precision is possible). The arithmetic cost is 7n additions to sum n words, and the error bound can be up to 10^(-8) times smaller than for conventional summation. We describe the algorithm, the software infrastructure for reproducible BLAS (ReproBLAS), and performance results. For example, for the dot product of 4096 doubles, we get a 4× slowdown compared to Intel MKL on an Intel Core i7-2600 CPU operating at 3.4 GHz and 256 KB L2 Cache.
BibTeX:
@article{Ahrens2020,
  author = {Ahrens, Peter and Demmel, James and Nguyen, Hong},
  title = {Algorithms for Efficient Reproducible Floating Point Summation},
  journal = {ACM Transactions on Mathematical Software},
  publisher = {Association for Computing Machinery},
  year = {2020},
  volume = {0},
  number = {ja},
  url = {https://dl.acm.org/doi/abs/10.1145/3389360},
  doi = {10.1145/3389360}
}
Ahrens P and Boman EG (2020), "On Optimal Partitioning For Sparse Matrices In Variable Block Row Format", May, 2020.
Abstract: The Variable Block Row (VBR) format is an influential blocked sparse matrix format designed to represent shared sparsity structure between adjacent rows and columns. VBR consists of groups of adjacent rows and columns, storing the resulting blocks that contain nonzeros in a dense format. This reduces the memory footprint and enables optimizations such as register blocking and instruction-level parallelism. Existing approaches use heuristics to determine which rows and columns should be grouped together. We adapt and optimize a dynamic programming algorithm for sequential hypergraph partitioning to produce a linear time algorithm which can determine the optimal partition of rows under an expressive cost model, assuming the column partition remains fixed. Furthermore, we show that the problem of determining an optimal partition for the rows and columns simultaneously is NP-Hard under a simple linear cost model. To evaluate our algorithm empirically against existing heuristics, we introduce the 1D-VBR format, a specialization of VBR format where columns are left ungrouped. We evaluate our algorithms on all 1626 real-valued matrices in the SuiteSparse Matrix Collection. When asked to minimize an empirically derived cost model for a sparse matrix-vector multiplication kernel, our algorithm produced partitions whose 1D-VBR realizations achieve a speedup of at least 1.18 over an unblocked kernel on 25% of the matrices, and a speedup of at least 1.59 on 12.5% of the matrices. The 1D-VBR representation produced by our algorithm had faster SpMVs than the 1D-VBR representations produced by any existing heuristics on 87.8% of the test matrices.
BibTeX:
@article{Ahrens2020a,
  author = {Peter Ahrens and Erik G. Boman},
  title = {On Optimal Partitioning For Sparse Matrices In Variable Block Row Format},
  year = {2020}
}
Ahrens P (2020), "Load Plus Communication Balancing in Contiguous Partitions for Distributed Sparse Matrices: Linear-Time Algorithms", July, 2020.
Abstract: We study partitioning to parallelize multiplication of one or more dense vectors by a sparse matrix (SpMV or SpMM). We consider contiguous partitions, where the rows (or columns) of the matrix are split into K parts without reordering. We present exact and approximate contiguous partitioning algorithms that minimize the runtime of the longest-running processor under cost models that combine work factors and hypergraph communication factors. This differs from traditional graph or hypergraph partitioning models which minimize total communication under a work balance constraint. We address regimes where partitions of the row space and column space are expected to match (the symmetric case) or are allowed to differ (the nonsymmetric case). Our algorithms use linear space. Our exact algorithm runs in linear time when K^2 is sublinear. Our (1 + )-approximate algorithm runs in linear time when K(1/) is sublinear. We combine concepts from high-performance computing and computational geometry. Existing load balancing algorithms optimize a linear model of per-processor work. We make minor adaptations to optimize arbitrary nonuniform monotonic increasing or decreasing cost functions which may be expensive to evaluate. We then show that evaluating our model of communication is equivalent to planar dominance counting. We specialize Chazelle's dominance counting algorithm to points in the bounded integer plane and generalize it to trade reduced construction time for increased query time, since our partitioners make very few queries. Our algorithms split the original row (or column) ordering into parts to optimize diverse cost models. Combined with reordering or embedding techniques, our algorithms might be used to build more general heuristic partitioners, as they can optimally round one-dimensional embeddings of direct K-way noncontiguous partitioning problems.
BibTeX:
@article{Ahrens2020b,
  author = {Peter Ahrens},
  title = {Load Plus Communication Balancing in Contiguous Partitions for Distributed Sparse Matrices: Linear-Time Algorithms},
  year = {2020}
}
AlAhmadi S, Muhammed T, Mehmood R and Albeshri A (2020), "Performance Characteristics for Sparse Matrix-Vector Multiplication on GPUs", In Smart Infrastructure and Applications: Foundations for Smarter Cities and Societies. Cham , pp. 409-426. Springer International Publishing.
Abstract: The massive parallelism provided by the graphics processing units (GPUs) offers tremendous performance in many high-performance computing applications. One such application is Sparse Matrix-Vector (SpMV) multiplication, which is an essential building block for numerous scientific and engineering applications. Researchers who propose new storage techniques for sparse matrix-vector multiplication focus mainly on a single evaluation metrics or performance characteristics which is usually the throughput performance of sparse matrix-vector multiplication in FLOPS. However, such an evaluation does not provide a deeper insight nor allow to compare new SpMV techniques with their competitors directly. In this chapter, we explain the notable performance characteristics of the GPU architectures and SpMV computations. We discuss various strategies to improve the performance of SpMV on GPUs. We also discuss a few performance criteria that are usually overlooked by the researchers during the evaluation process. We also analyze various well-known schemes such as COO, CSR, ELL, DIA, HYB, and CSR5 using the discussed performance characteristics.
BibTeX:
@inbook{AlAhmadi2020,
  author = {AlAhmadi, Sarah and Muhammed, Thaha and Mehmood, Rashid and Albeshri, Aiiad},
  editor = {Mehmood, Rashid and See, Simon and Katib, Iyad and Chlamtac, Imrich},
  title = {Performance Characteristics for Sparse Matrix-Vector Multiplication on GPUs},
  booktitle = {Smart Infrastructure and Applications: Foundations for Smarter Cities and Societies},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {409--426},
  doi = {10.1007/978-3-030-13705-2_17}
}
Alappat CL, Alvermann A, Basermann A, Fehske H, Futamura Y, Galgon M, Hager G, Huber S, Imakura A, Kawai M, Kreutzer M, Lang B, Nakajima K, Röhrig-Zöllner M, Sakurai T, Shahzad F, Thies J and Wellein G (2020), "ESSEX: Equipping Sparse Solvers For Exascale", In Software for Exascale Computing -- SPPEXA 2016--2019. , pp. 143-187. Springer International Publishing.
Abstract: The ESSEX project has investigated programming concepts, data structures, and numerical algorithms for scalable, efficient, and robust sparse eigenvalue solvers on future heterogeneous exascale systems. Starting without the burden of legacy code, a holistic performance engineering process could be deployed across the traditional software layers to identify efficient implementations and guide sustainable software development. At the basic building blocks level, a flexible MPI+X programming approach was implemented together with a new sparse data structure (SELL-C-σ) to support heterogeneous architectures by design. Furthermore, ESSEX focused on hardware-efficient kernels for all relevant architectures and efficient data structures for block vector formulations of the eigensolvers. The algorithm layer addressed standard, generalized, and nonlinear eigenvalue problems and provided some widely usable solver implementations including a block Jacobi--Davidson algorithm, contour-based integration schemes, and filter polynomial approaches. Adding to the highly efficient kernel implementations, algorithmic advances such as adaptive precision, optimized filtering coefficients, and preconditioning have further improved time to solution. These developments were guided by quantum physics applications, especially from the field of topological insulator- or graphene-based systems. For these, ScaMaC, a scalable matrix generation framework for a broad set of quantum physics problems, was developed. As the central software core of ESSEX, the PHIST library for sparse systems of linear equations and eigenvalue problems has been established. It abstracts algorithmic developments from low-level optimization. Finally, central ESSEX software components and solvers have demonstrated scalability and hardware efficiency on up to 256 K cores using million-way process/thread-level parallelism.
BibTeX:
@inproceedings{Alappat2020,
  author = {Alappat, Christie L. and Alvermann, Andreas and Basermann, Achim and Fehske, Holger and Futamura, Yasunori and Galgon, Martin and Hager, Georg and Huber, Sarah and Imakura, Akira and Kawai, Masatoshi and Kreutzer, Moritz and Lang, Bruno and Nakajima, Kengo and Röhrig-Zöllner, Melven and Sakurai, Tetsuya and Shahzad, Faisal and Thies, Jonas and Wellein, Gerhard},
  editor = {Bungartz, Hans-Joachim and Reiz, Severin and Uekermann, Benjamin and Neumann, Philipp and Nagel, Wolfgang E.},
  title = {ESSEX: Equipping Sparse Solvers For Exascale},
  booktitle = {Software for Exascale Computing -- SPPEXA 2016--2019},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {143--187}
}
Alappat CL, Alvermann A, Basermann A, Fehske H, Futamura Y, Galgon M, Hager G, Huber S, Imakura A, Kawai M, Kreutzer M, Lang B, Nakajima K, Röhrig-Zöllner M, Sakurai T, Shahzad F, Thies J and Wellein G (2020), "Equipping Sparse Solvers For Exascale", Software for Exascale Computing -- SPPEXA 2016-2019. , pp. 143-187.
Abstract: The ESSEX project has investigated programming concepts, data structures, and numerical algorithms for scalable, efficient, and robust sparse eigenvalue solvers on future heterogeneous exascale systems. Starting without the burden of legacy code, a holistic performance engineering process could be deployed across the traditional software layers to identify efficient implementations and guide sustainable software development. At the basic building blocks level, a flexible MPI+X programming approach was implemented together with a new sparse data structure (SELL-C-σ) to support heterogeneous architectures by design. Furthermore, ESSEX focused on hardware-efficient kernels for all relevant architectures and efficient data structures for block vector formulations of the eigensolvers. The algorithm layer addressed standard, generalized, and nonlinear eigenvalue problems and provided some widely usable solver implementations including a block JacobiDavidson algorithm, contour-based integration schemes, and filter polynomial approaches. Adding to the highly efficient kernel implementations, algorithmic advances such as adaptive precision, optimized filtering coefficients, and preconditioning have further improved time to solution. These developments were guided by the field of quantum physics applications, and especially by current topics such as topological insulator systems or problems from graphene research. For these, ScaMaC, a scalable matrix generation framework for a broad set of quantum physics problems, was developed. As the central software core of ESSEX, the PHIST library for sparse linear and eigenvalue problems has been established. It abstracts algorithmic developments from low-level optimization. Finally, central ESSEX software components and solvers have demonstrated scalability and hardware efficiency on up to 256 K cores using million-way process/thread-level parallelism.
BibTeX:
@article{Alappat2020a,
  author = {Christie L. Alappat and Andreas Alvermann and Achim Basermann and Holger Fehske and Yasunori Futamura and Martin Galgon and Georg Hager and Sarah Huber and Akira Imakura and Masatoshi Kawai and Moritz Kreutzer and Bruno Lang and Kengo Nakajima and Melven Röhrig-Zöllner and Tetsuya Sakurai and Faisal Shahzad and Jonas Thies and Gerhard Wellein},
  title = {Equipping Sparse Solvers For Exascale},
  journal = {Software for Exascale Computing -- SPPEXA 2016-2019},
  year = {2020},
  pages = {143--187}
}
Alghunaim SA (2020), "On the Performance and Linear Convergence of Decentralized Primal-Dual Methods". Thesis at: University of California Los Angeles.
Abstract: This dissertation studies the performance and linear convergence properties of primal-dual methods for the solution of decentralized multi-agent optimization problems. Decentralized multi-agent optimization is a powerful paradigm that finds applications in diverse fields in learning and engineering design. In these setups, a network of agents is connected through some topology and agents are allowed to share information only locally. Their overall goal is to seek the minimizer of a global optimization problem through localized interactions. In decentralized consensus problems, the agents are coupled through a common consensus variable that they need to agree upon. While in decentralized resource allocation problems, the agents are coupled through global affine constraints.\ Various decentralized consensus optimization algorithms already exist in the literature. Some methods are derived from a primal-dual perspective, while other methods are derived as gradient tracking mechanisms meant to track the average of local gradients. Among the gradient tracking methods are the adapt-then-combine implementations motivated by diffusion strategies, which have been observed to perform better than other implementations. In this dissertation, we develop a novel adapt-then-combine primal-dual algorithmic framework that captures most state-of-the-art gradient based methods as special cases including all the variations of the gradient-tracking methods. We also develop a concise and novel analysis technique that establishes the linear convergence of this general framework under stronglyii convex objectives. Due to our unified framework, the analysis reveals important characteristics for these methods such as their convergence rates and step-size stability ranges. Moreover, the analysis reveals how the augmented Lagrangian penalty term, which is utilized in most of these methods, affects the performance of decentralized algorithms.\ Another important question that we answer is whether decentralized proximal gradient methods can achieve global linear convergence for non-smooth composite optimization. For centralized algorithms, linear convergence has been established in the presence of a nonsmooth composite term. In this dissertation, we close the gap between centralized and decentralized proximal gradient algorithms and show that decentralized proximal algorithms can also achieve linear convergence in the presence of a non-smooth term. Furthermore, we show that when each agent possesses a different local non-smooth term then global linear convergence cannot be established in the worst case.\ Most works that study decentralized optimization problems assume that all agents are involved in computing all variables. However, in many applications the coupling across agents is sparse in the sense that only a few agents are involved in computing certain variables. We show how to design decentralized algorithms in sparsely coupled consensus and resource allocation problems. More importantly, we establish analytically the importance of exploiting the sparsity structure in coupled large-scale networks.
BibTeX:
@phdthesis{Alghunaim2020,
  author = {Alghunaim, Sulaiman A.},
  title = {On the Performance and Linear Convergence of Decentralized Primal-Dual Methods},
  school = {University of California Los Angeles},
  year = {2020}
}
Al-Harthi N, A.Alomairy RM, Akbudak K, Chen R, Ltaief H, Bagci H and Keyes DE (2020), "Solving Acoustic Boundary Integral Equations Using High Performance Tile Low-Rank LU Factorization", In Proceedings of ISC High Performance 2020., June, 2020.
Abstract: We design and develop a new high performance implementation of a fast direct LU-based solver using low-rank approximations on massively parallel systems. The LU factorization is the most time-consuming step in solving systems of linear equations in the context of analyzing acoustic scattering from large 3D objects. The matrix equation is obtained by discretizing the boundary integral of the exterior Helmholtz problem using a higher-order Nyström scheme. The main idea is to exploit the inherent data sparsity of the matrix operator by performing local tile-centric approximations while still capturing the most significant information. In particular, the proposed LU-based solver leverages the Tile Low-Rank (TLR) data compression format as implemented in the Hierarchical Computations on Manycore Architectures (HiCMA) library to decrease the complexity of “classical” dense direct solvers from cubic to quadratic order. We taskify the underlying boundary integral kernels to expose fine-grained computations. We then employ the dynamic runtime system StarPU to orchestrate the scheduling of computational tasks on shared and distributed-memory systems. The resulting asynchronous execution permits to compensate for the load imbalance due to the heterogeneous ranks, while mitigating the overhead of data motion. We assess the robustness of our TLR LU-based solver and study the qualitative impact when using different numerical accuracies. The new TLR LU factorization outperforms the state-of-the-art dense factorizations by up to an order of magnitude on various parallel systems, for analysis of scattering from large-scale 3D synthetic and real geometries.
BibTeX:
@inproceedings{AlHarthi2020,
  author = {Al-Harthi, Noha and A.Alomairy, Rabab M. and Akbudak, Kadir and Chen, Rui and Ltaief, Hatem and Bagci, Hakan and Keyes, David E.},
  title = {Solving Acoustic Boundary Integral Equations Using High Performance Tile Low-Rank LU Factorization},
  booktitle = {Proceedings of ISC High Performance 2020},
  year = {2020},
  url = {http://hdl.handle.net/10754/663212}
}
Aliaga JI, Anzt H, Grützmacher T, Quintana-Ortí ES and Tomás AE (2020), "Compressed Basis GMRES on High Performance GPUs", September, 2020.
Abstract: Krylov methods provide a fast and highly parallel numerical tool for the iterative solution of many large-scale sparse linear systems. To a large extent, the performance of practical realizations of these methods is constrained by the communication bandwidth in all current computer architectures, motivating the recent investigation of sophisticated techniques to avoid, reduce, and/or hide the message-passing costs (in distributed platforms) and the memory accesses (in all architectures). This paper introduces a new communication-reduction strategy for the (Krylov) GMRES solver that advocates for decoupling the storage format (i.e., the data representation in memory) of the orthogonal basis from the arithmetic precision that is employed during the operations with that basis. Given that the execution time of the GMRES solver is largely determined by the memory access, the datatype transforms can be mostly hidden, resulting in the acceleration of the iterative step via a lower volume of bits being retrieved from memory. Together with the special properties of the orthonormal basis (whose elements are all bounded by 1), this paves the road toward the aggressive customization of the storage format, which includes some floating point as well as fixed point formats with little impact on the convergence of the iterative process. We develop a high performance implementation of the "compressed basis GMRES" solver in the Ginkgo sparse linear algebra library and using a large set of test problems from the SuiteSparse matrix collection we demonstrate robustness and performance advantages on a modern NVIDIA V100 GPU of up to 50% over the standard GMRES solver that stores all data in IEEE double precision.
BibTeX:
@article{Aliaga2020,
  author = {José I. Aliaga and Hartwig Anzt and Thomas Grützmacher and Enrique S. Quintana-Ortí and Andrés E. Tomás},
  title = {Compressed Basis GMRES on High Performance GPUs},
  year = {2020}
}
Alimisis F, Orvieto A, Bécigneul G and Lucchi A (2020), "Practical Accelerated Optimization on Riemannian Manifolds", February, 2020.
Abstract: We develop a new Riemannian descent algorithm with an accelerated rate of convergence. We focus on functions that are geodesically convex or weakly-quasi-convex, which are weaker function classes compared to prior work that has considered geodesically strongly convex functions. Our proof of convergence relies on a novel estimate sequence which allows to demonstrate the dependency of the convergence rate on the curvature of the manifold. We validate our theoretical results empirically on several optimization problems defined on a sphere and on the manifold of positive definite matrices.
BibTeX:
@article{Alimisis2020,
  author = {Foivos Alimisis and Antonio Orvieto and Gary Bécigneul and Aurelien Lucchi},
  title = {Practical Accelerated Optimization on Riemannian Manifolds},
  year = {2020}
}
Alimo R, Beyhaghi P and Bewley TR (2020), "Delaunay-based derivative-free optimization via global surrogates. Part III: nonconvex constraints", Journal of Global Optimization., 1, 2020.
Abstract: This paper introduces a Delaunay-based derivative-free optimization algorithm, dubbed - -DOGS(), for problems with both (a) a nonconvex, computationally expensive objective function f(x), and (b) nonlinear, computationally expensive constraint functions c_l(x) which, taken together, define a nonconvex, possibly even disconnected feasible domain Ω which is assumed to lie within a known rectangular search domain _s, everywhere within which the f(x) and c_l(x) may be evaluated. Approximations of both the objective function f(x) as well as the feasible domain Ω are developed and refined as the iterations proceed. The approach is practically limited to the problems with less than about ten adjustable parameters. The work is an extension of our original Delaunay-based optimization algorithm (see JOGO DOI: 10.1007/s10898-015-0384-2), and inherits many of the constructions and strengths of that algorithm, including: (1) a surrogate function p(x) interpolating all existing function evaluations and summarizing their trends, (2) a synthetic, piecewise-quadratic uncertainty function e(x) built on the framework of a Delaunay triangulation amongst existing datapoints, (3) a tunable balance between global exploration (large K) and local refinement (small K), (4) provable global convergence for a sufficiently large K, under the assumption that the objective and constraint functions are twice differentiable with bounded Hessians, (5) an Adaptive-K variant of the algorithm that efficiently tunes K automatically based on a target value of the objective function, and (6) remarkably fast global convergence on a variety of benchmark problems.
BibTeX:
@article{Alimo2020,
  author = {Alimo, Ryan and Beyhaghi, Pooriya and Bewley, Thomas R.},
  title = {Delaunay-based derivative-free optimization via global surrogates. Part III: nonconvex constraints},
  journal = {Journal of Global Optimization},
  year = {2020},
  doi = {10.1007/s10898-019-00854-2}
}
Allman A and Zhang Q (2020), "Branch-and-Price for a Class of Nonconvex Mixed-Integer Nonlinear Programs", January, 2020.
Abstract: This work attempts to combine the strengths of two major technologies that have matured over the last three decades: global mixed-integer nonlinear optimization and branch-and-price. We consider a class of generally nonconvex mixed-integer nonlinear programs (MINLPs) with linear complicating constraints and integer linking variables. If the complicating constraints are removed, the problem becomes easy to solve, e.g. due to decomposable structure. Integrality of the linking variables allows us to apply a discretization approach to derive a Dantzig-Wolfe reformulation and solve the problem to global optimality using branch-and-price. It is a remarkably simple idea; but to our surprise, it has barely found any application in the literature. In this work, we show that many relevant problems directly fall or can be reformulated into this class of MINLPs. We present the branch-and-price algorithm and demonstrate its effectiveness (and sometimes ineffectiveness) in an extensive computational study considering multiple large-scale problems of practical relevance, showing that, in many cases, orders-of-magnitude reductions in solution time can be achieved.
BibTeX:
@article{Allman2020,
  author = {Andrew Allman and Qi Zhang},
  title = {Branch-and-Price for a Class of Nonconvex Mixed-Integer Nonlinear Programs},
  year = {2020}
}
Altschuler JM and Parrilo PA (2020), "Random Osborne: a simple, practical algorithm for Matrix Balancing in near-linear time", April, 2020.
Abstract: We revisit Matrix Balancing, a pre-conditioning task used ubiquitously for computing eigenvalues and matrix exponentials. Since 1960, Osborne's algorithm has been the practitioners' algorithm of choice, and is now implemented in most numerical software packages. However, the theoretical properties of Osborne's algorithm are not well understood. Here, we show that a simple random variant of Osborne's algorithm converges in near-linear time in the input sparsity. Specifically, it balances K∊ℝ_≥ 0^n× n after O(m-2log) arithmetic operations, where m is the number of nonzeros in K, 𝜖 is the _1 accuracy, and kappa=_ijK_ij/(_ij:K_ij≠ 0K_ij) measures the conditioning of K. Previous work had established near-linear runtimes either only for _2 accuracy (a weaker criterion which is less relevant for applications), or through an entirely different algorithm based on (currently) impractical Laplacian solvers. We further show that if the graph with adjacency matrix K is moderately connected--e.g., if K has at least one positive row/column pair--then Osborne's algorithm initially converges exponentially fast, yielding an improved runtime O(m-1log). We also address numerical precision issues by showing that these runtime bounds still hold when using O((n/))-bit numbers. Our results are established through a potential argument that leverages a convex optimization perspective of Osborne's algorithm, and relates the per-iteration progress to the current imbalance as measured in Hellinger distance. Unlike previous analyses, we critically exploit log-convexity of the potential. Our analysis extends to other variants of Osborne's algorithm: along the way, we establish significantly improved runtime bounds for cyclic, greedy, and parallelized variants.
BibTeX:
@article{Altschuler2020,
  author = {Jason M. Altschuler and Pablo A. Parrilo},
  title = {Random Osborne: a simple, practical algorithm for Matrix Balancing in near-linear time},
  year = {2020}
}
Altschuler JM and Parrilo PA (2020), "Approximating Min-Mean-Cycle for low-diameter graphs in near-optimal time and memory", April, 2020.
Abstract: We revisit Min-Mean-Cycle, the classical problem of finding a cycle in a weighted directed graph with minimum mean weight. Despite an extensive algorithmic literature, previous work falls short of a near-linear runtime in the number of edges m--in fact, there is a natural barrier which precludes such a runtime for solving Min-Mean-Cycle exactly. Here, we give a much faster approximation algorithm that, for graphs with polylogarithmic diameter, has near-linear runtime. In particular, this is the first algorithm whose runtime for the complete graph scales in the number of vertices n as O(n^2). Moreover--unconditionally on the diameter--the algorithm uses only O(n) memory beyond reading the input, making it "memory-optimal". The algorithm is also simple to implement and has remarkable practical performance. Our approach is based on solving a linear programming (LP) relaxation using entropic regularization, which effectively reduces the LP to a Matrix Balancing problem--a la the popular reduction of Optimal Transport to Matrix Scaling. We then round the fractional LP solution using a variant of the classical Cycle-Cancelling algorithm that is sped up to near-linear runtime at the expense of being approximate, and implemented in a memory-optimal manner. We also provide an alternative algorithm with slightly faster theoretical runtime, albeit worse memory usage and practicality. This algorithm uses the same rounding procedure, but solves the LP relaxation by leveraging recent developments in area-convexity regularization. Its runtime scales inversely in the approximation accuracy, which we show is optimal--barring a major breakthrough in algorithmic graph theory, namely faster Shortest Paths algorithms.
BibTeX:
@article{Altschuler2020a,
  author = {Jason M. Altschuler and Pablo A. Parrilo},
  title = {Approximating Min-Mean-Cycle for low-diameter graphs in near-optimal time and memory},
  year = {2020}
}
Alyahya H, Mehmood R and Katib I (2020), "Parallel Iterative Solution of Large Sparse Linear Equation Systems on the Intel MIC Architecture", In Smart Infrastructure and Applications: Foundations for Smarter Cities and Societies. Cham , pp. 377-407. Springer International Publishing.
Abstract: Many important scientific, engineering, and smart city applications require solving large sparse linear equation systems. The numerical methods for solving linear equations can be categorised into direct methods and iterative methods. Jacobi method is one of the iterative solvers that has been widely used due to its simplicity and efficiency. Its performance is affected by factors including the storage format, the specific computational algorithm, and its implementation. While the performance of Jacobi has been studied extensively on conventional CPU architectures, research on its performance on emerging architectures, such as the Intel Many Integrated Core (MIC) architecture, is still in its infancy. In this chapter, we investigate the performance of parallel implementations of the Jacobi method on Knights Corner (KNC), the first generation of the Intel MIC architectures. We implement Jacobi with two storage formats, Compressed Sparse Row (CSR) and Modified Sparse Row (MSR), and measure their performance in terms of execution time, offloading time, and speedup. We report results of sparse matrices with over 28 million rows and 640 million non-zero elements acquired from 13 diverse application domains. The experimental results show that our Jacobi parallel implementation on MIC achieves speedups of up to 27.75× compared to the sequential implementation. It also delivers a speedup of up to 3.81× compared to a powerful node comprising 24 cores in two Intel Xeon E5-2695v2 processors.
BibTeX:
@inbook{Alyahya2020,
  author = {Alyahya, Hana and Mehmood, Rashid and Katib, Iyad},
  editor = {Mehmood, Rashid and See, Simon and Katib, Iyad and Chlamtac, Imrich},
  title = {Parallel Iterative Solution of Large Sparse Linear Equation Systems on the Intel MIC Architecture},
  booktitle = {Smart Infrastructure and Applications: Foundations for Smarter Cities and Societies},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {377--407},
  doi = {10.1007/978-3-030-13705-2_16}
}
Amaral VS, Andreani R, Birgin EG, Marcondes DS and Martínez JM (2020), "On complexity and convergence of high-order coordinate descent algorithms", September, 2020.
Abstract: Coordinate descent methods with high-order regularized models for box-constrained minimization are introduced. High-order stationarity asymptotic convergence and first-order stationarity worst-case evaluation complexity bounds are established. The computer work that is necessary for obtaining first-order varepsilon-stationarity with respect to the variables of each coordinate-descent block is O(-(p+1)/p) whereas the computer work for getting first-order varepsilon-stationarity with respect to all the variables simultaneously is O(-(p+1)). Numerical examples involving multidimensional scaling problems are presented. The numerical performance of the methods is enhanced by means of coordinate-descent strategies for choosing initial points.
BibTeX:
@article{Amaral2020,
  author = {V. S. Amaral and R. Andreani and E. G. Birgin and D. S. Marcondes and J. M. Martínez},
  title = {On complexity and convergence of high-order coordinate descent algorithms},
  year = {2020}
}
Anaissi A, Suleiman B and Zandavi SM (2020), "NeCPD: An Online Tensor Decomposition with Optimal Stochastic Gradient Descent", March, 2020.
Abstract: Multi-way data analysis has become an essential tool for capturing underlying structures in higher-order datasets stored in tensor X ∊ ℝ ^I_1 × dots × I_N. CANDECOMP/PARAFAC (CP) decomposition has been extensively studied and applied to approximate X by N loading matrices A^(1), , A^(N) where N represents the order of the tensor. We propose a new efficient CP decomposition solver named NeCPD for non-convex problem in multi-way online data based on stochastic gradient descent (SGD) algorithm. SGD is very useful in online setting since it allows us to update X^(t+1) in one single step. In terms of global convergence, it is well known that SGD stuck in many saddle points when it deals with non-convex problems. We study the Hessian matrix to identify theses saddle points, and then try to escape them using the perturbation approach which adds little noise to the gradient update step. We further apply Nesterov's Accelerated Gradient (NAG) method in SGD algorithm to optimally accelerate the convergence rate and compensate Hessian computational delay time per epoch. Experimental evaluation in the field of structural health monitoring using laboratory-based and real-life structural datasets show that our method provides more accurate results compared with existing online tensor analysis methods.
BibTeX:
@article{Anaissi2020,
  author = {Ali Anaissi and Basem Suleiman and Seid Miad Zandavi},
  title = {NeCPD: An Online Tensor Decomposition with Optimal Stochastic Gradient Descent},
  year = {2020}
}
Andrei N (2020), "Nonlinear Conjugate Gradient Methods for Unconstrained Optimization" Springer International Publishing.
Abstract: Two approaches are known for solving large-scale unconstrained optimization problems—the limited-memory quasi-Newton method (truncated Newton method) and the conjugate gradient method. This is the first book to detail conjugate gradient methods, showing their properties and convergence characteristics as well as their performance in solving large-scale unconstrained optimization problems and applications. Comparisons to the limited-memory and truncated Newton methods are also discussed. Topics studied in detail include: linear conjugate gradient methods, standard conjugate gradient methods, acceleration of conjugate gradient methods, hybrid, modifications of the standard scheme, memoryless BFGS preconditioned, and three-term. Other conjugate gradient methods with clustering the eigenvalues or with the minimization of the condition number of the iteration matrix, are also treated. For each method, the convergence analysis, the computational performances and the comparisons versus other conjugate gradient methods are given. \ The theory behind the conjugate gradient algorithms presented as a methodology is developed with a clear, rigorous, and friendly exposition; the reader will gain an understanding of their properties and their convergence and will learn to develop and prove the convergence of his/her own methods. Numerous numerical studies are supplied with comparisons and comments on the behavior of conjugate gradient algorithms for solving a collection of 800 unconstrained optimization problems of different structures and complexities with the number of variables in the range [1000,10000]. The book is addressed to all those interested in developing and using new advanced techniques for solving unconstrained optimization complex problems. Mathematical programming researchers, theoreticians and practitioners in operations research, practitioners in engineering and industry researchers, as well as graduate students in mathematics, Ph.D. and master students in mathematical programming, will find plenty of information and practical applications for solving large-scale unconstrained optimization problems and applications by conjugate gradient methods.
BibTeX:
@book{Andrei2020,
  author = {Neculai Andrei},
  title = {Nonlinear Conjugate Gradient Methods for Unconstrained Optimization},
  publisher = {Springer International Publishing},
  year = {2020},
  doi = {10.1007/978-3-030-42950-8}
}
Angerd A (2020), "Approximation and Compression Techniques to Enhance Performance of Graphics Processing Units". Thesis at: Division of Computer Engineering, Department of Computer Science & Engineering, Chalmers University of Technology.
Abstract: A key challenge in modern computing systems is to access data fast enough to fully utilize the computing elements in the chip. In Graphics Processing Units (GPUs), the performance is often constrained by register file size, memory bandwidth, and the capacity of the main memory. One important technique towards alleviating this challenge is data compression. By reducing the amount of data that needs to be communicated or stored, memory resources crucial for performance can be eciently utilized.\ This thesis provides a set of approximation and compression techniques for GPUs, with the goal of eciently utilizing the computational fabric, and thereby increase performance. The thesis shows that these techniques can substantially lower the amount of information the system has to process, and are thus important tools in the process of meeting challenges in memory utilization. This thesis makes contributions within three areas: controlled floating-point precision reduction, lossless and lossy memory compression, and distributed training of neural networks. In the first area, the thesis shows that through automated and controlled floating-point approximation, the register file can be more eciently utilized. This is achieved through a framework which establishes a cross-layer connection between the application and the microarchitecture layer, and a novel register file organization capable of leveraging low-precision floatingpoint values and narrow integers for increased capacity and performance.\ Within the area of compression, this thesis aims at increasing the effective bandwidth of GPUs by presenting a lossless and lossy memory compression algorithm to reduce the amount of transferred data. In contrast to state-ofthe-art compression techniques such as Base-Delta-Immediate and Bitplane Compression, which uses intra-block bases for compression, the proposed algorithm leverages multiple global base values to reach a higher compression ratio. The algorithm includes an optional approximation step for floating-point values which offers higher compression ratio at a given, low, error rate.\ Finally, within the area of distributed training of neural networks, this thesis proposes a subgraph approximation scheme for graph data which mitigates accuracy loss in a distributed setting. The scheme allows neural network models that use graphs as inputs to converge at single-machine accuracy, while minimizing synchronization overhead between the machines.
BibTeX:
@phdthesis{Angerd2020,
  author = {Alexandra Angerd},
  title = {Approximation and Compression Techniques to Enhance Performance of Graphics Processing Units},
  school = {Division of Computer Engineering, Department of Computer Science & Engineering, Chalmers University of Technology},
  year = {2020},
  url = {https://research.chalmers.se/publication/521610/file/521610_Fulltext.pdf}
}
Angriman E, Predari M, van der Grinten A and Meyerhenke H (2020), "Approximation of the Diagonal of a Laplacian's Pseudoinverse for Complex Network Analysis", June, 2020.
Abstract: The ubiquity of massive graph data sets in numerous applications requires fast algorithms for extracting knowledge from these data. We are motivated here by three electrical measures for the analysis of large small-world graphs G = (V, E) -- i.e., graphs with diameter in O(log |V|), which are abundant in complex network analysis. From a computational point of view, the three measures have in common that their crucial component is the diagonal of the graph Laplacian's pseudoinverse, L^dagger. Computing diag(L^) exactly by pseudoinversion, however, is as expensive as dense matrix multiplication -- and the standard tools in practice even require cubic time. Moreover, the pseudoinverse requires quadratic space -- hardly feasible for large graphs. Resorting to approximation by, e.g., using the Johnson-Lindenstrauss transform, requires the solution of O(log |V| / 2) Laplacian linear systems to guarantee a relative error, which is still very expensive for large inputs. In this paper, we present a novel approximation algorithm that requires the solution of only one Laplacian linear system. The remaining parts are purely combinatorial -- mainly sampling uniform spanning trees, which we relate to diag(L^) via effective resistances. For small-world networks, our algorithm obtains a ± 𝜖-approximation with high probability, in a time that is nearly-linear in |E| and quadratic in 1 / 𝜖. Another positive aspect of our algorithm is its parallel nature due to independent sampling. We thus provide two parallel implementations of our algorithm: one using OpenMP, one MPI + OpenMP. In our experiments against the state of the art, our algorithm (i) yields more accurate results, (ii) is much faster and more memory-efficient, and (iii) obtains good parallel speedups, in particular in the distributed setting.
BibTeX:
@article{Angriman2020,
  author = {Eugenio Angriman and Maria Predari and Alexander van der Grinten and Henning Meyerhenke},
  title = {Approximation of the Diagonal of a Laplacian's Pseudoinverse for Complex Network Analysis},
  year = {2020}
}
Anikin A, Dorn Y and Nesterov Y (2020), "Computational Methods for the Stable Dynamic Model", In Communications in Computer and Information Science. , pp. 280-294. Springer International Publishing.
Abstract: Traffic assignment problem is one of the central problems in transportation science. Various model assumptions lead to different setups corresponding to nonlinear optimization problems.\ In this work, we focus on the stable dynamic model and its generalizations. We propose new equivalent representation for stable dynamic model [Nesterov and de Palma, 2003]. We use smoothing technique to derive new model, which can be interpreted as a stochastic equilibrium model.
BibTeX:
@incollection{Anikin2020,
  author = {Anton Anikin and Yuriy Dorn and Yurii Nesterov},
  title = {Computational Methods for the Stable Dynamic Model},
  booktitle = {Communications in Computer and Information Science},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {280--294},
  doi = {10.1007/978-3-030-38603-0_21}
}
Antonakopoulos K, Belmega EV and Mertikopoulos P (2020), "Online and Stochastic Optimization beyond Lipschitz Continuity: A Riemannian Approach", In Proceedings of the 8th International Conference on Learning Representations.
Abstract: Motivated by applications to machine learning and imaging science, we study a class of online and stochastic optimization problems with loss functions that are not Lipschitz continuous; in particular, the loss functions encountered by the optimizer could exhibit gradient singularities or be singular themselves. Drawing on tools and techniques from Riemannian geometry, we examine a Riemann–Lipschitz (RL) continuity condition which is tailored to the singularity landscape of the problem's loss functions. In this way, we are able to tackle cases beyond the Lipschitz framework provided by a global norm, and we derive optimal regret bounds and last iterate convergence results through the use of regularized learning methods (such as online mirror descent). These results are subsequently validated in a class of stochastic Poisson inverse problems that arise in imaging science.
BibTeX:
@inproceedings{Antonakopoulos2020,
  author = {Kimon Antonakopoulos and E. Veronica Belmega and Panayotis Mertikopoulos},
  title = {Online and Stochastic Optimization beyond Lipschitz Continuity: A Riemannian Approach},
  booktitle = {Proceedings of the 8th International Conference on Learning Representations},
  year = {2020}
}
Anwer AR, Li G, Pattabiraman K, Sullivan M, Tsai T and Hari SKS (2020), "GPU-Trident: Efficient Modeling of ErrorPropagation in GPU Programs", Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis.
Abstract: Fault injection (FI) techniques are typically used to determine the reliability profiles of programs under soft errors. However, these techniques are highly resource- and time-intensive. Prior research developed a model, TRIDENT to analytically predict Silent Data Corruption (SDC, i.e., incorrect output without any indication) probabilities of single-threaded CPU applications without requiring FIs. Unfortunately, TRIDENT is incompatible with GPU programs, due to their high degree of parallelism and different memory architectures than CPU programs. The main challenge is that modeling error propagation across thousands of threads in a GPU kernel requires enormous amounts of data to be profiled and analyzed, posing a major scalability bottleneck for HPC applications.\ In this paper, we propose GPU-TRIDENT, an accurate and scalable technique for modeling error propagation in GPU programs. We find that GPU-TRIDENT is 2 orders of magnitude faster than FI-based approaches, and nearly as accurate in determining the SDC rate of GPU programs.
BibTeX:
@article{Anwer2020,
  author = {Abdul Rehman Anwer and Guanpeng Li and Karthik Pattabiraman and Michael Sullivan and Timothy Tsai and Siva Kumar Sastry Hari},
  title = {GPU-Trident: Efficient Modeling of ErrorPropagation in GPU Programs},
  journal = {Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis},
  year = {2020},
  url = {http://blogs.ubc.ca/karthik/files/2020/08/SC2020-final.pdf}
}
Anzt H, Boman E, Falgout R, Ghysels P, Heroux M, Li X, McInnes LC, Mills RT, Rajamanickam S, Rupp K, Smith B, Yamazaki I and Yang UM (2020), "Preparing sparse solvers for exascale computing", Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences., 1, 2020. Vol. 378(2166), pp. 20190053. The Royal Society.
Abstract: Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges.
BibTeX:
@article{Anzt2020,
  author = {Hartwig Anzt and Erik Boman and Rob Falgout and Pieter Ghysels and Michael Heroux and Xiaoye Li and Lois Curfman McInnes and Richard Tran Mills and Sivasankaran Rajamanickam and Karl Rupp and Barry Smith and Ichitaro Yamazaki and Ulrike Meier Yang},
  title = {Preparing sparse solvers for exascale computing},
  journal = {Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences},
  publisher = {The Royal Society},
  year = {2020},
  volume = {378},
  number = {2166},
  pages = {20190053},
  doi = {10.1098/rsta.2019.0053}
}
Anzt H, Cojean T, Yen-Chen C, Dongarra J, Flegar G, Nayak P, Tomov S, Tsai YM and Wang W (2020), "Load-balancing Sparse Matrix Vector Product Kernels on GPUs", ACM Transactions on Parallel Computing., 3, 2020. Vol. 7(1), pp. 1-26. Association for Computing Machinery (ACM).
Abstract: Efficient processing of Irregular Matrices on Single Instruction, Multiple Data (SIMD)-type architectures is a persistent challenge. Resolving it requires innovations in the development of data formats, computational techniques, and implementations that strike a balance between thread divergence, which is inherent for Irregular Matrices, and padding, which alleviates the performance-detrimental thread divergence but introduces artificial overheads. To this end, in this article, we address the challenge of designing high performance sparse matrix-vector product (SpMV) kernels designed for Nvidia Graphics Processing Units (GPUs). We present a compressed sparse row (CSR) format suitable for unbalanced matrices. We also provide a load-balancing kernel for the coordinate (COO) matrix format and extend it to a hybrid algorithm that stores part of the matrix in SIMD-friendly Ellpack format (ELL) format. The ratio between the ELL- and the COO-part is determined using a theoretical analysis of the nonzeros-per-row distribution. For the over 2,800 test matrices available in the Suite Sparse matrix collection, we compare the performance against SpMV kernels provided by NVIDIA's cuSPARSE library and a heavily-tuned sliced ELL (SELL-P) kernel that prevents unnecessary padding by considering the irregular matrices as a combination of matrix blocks stored in ELL format.
BibTeX:
@article{Anzt2020a,
  author = {Hartwig Anzt and Terry Cojean and Chen Yen-Chen and Jack Dongarra and Goran Flegar and Pratik Nayak and Stanimire Tomov and Yuhsiang M. Tsai and Weichung Wang},
  title = {Load-balancing Sparse Matrix Vector Product Kernels on GPUs},
  journal = {ACM Transactions on Parallel Computing},
  publisher = {Association for Computing Machinery (ACM)},
  year = {2020},
  volume = {7},
  number = {1},
  pages = {1--26},
  doi = {10.1145/3380930}
}
Anzt H, Cojean T, Flegar G, Göbel F, Grützmacher T, Nayak P, Ribizel T, Tsai YM and Quintana-Ortí ES (2020), "Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing", June, 2020.
Abstract: In this paper, we present Ginkgo, a modern C++ math library for scientific high performance computing. While classical linear algebra libraries act on matrix and vector objects, Ginkgo's design principle abstracts all functionality as "linear operators", motivating the notation of a "linear operator algebra library". Ginkgo's current focus is oriented towards providing sparse linear algebra functionality for high performance GPU architectures, but given the library design, this focus can be easily extended to accommodate other algorithms and hardware architectures. We introduce this sophisticated software architecture that separates core algorithms from architecture-specific back ends and provide details on extensibility and sustainability measures. We also demonstrate Ginkgo's usability by providing examples on how to use its functionality inside the MFEM and deal.ii finite element ecosystems. Finally, we offer a practical demonstration of Ginkgo's high performance on state-of-the-art GPU architectures.
BibTeX:
@article{Anzt2020b,
  author = {Hartwig Anzt and Terry Cojean and Goran Flegar and Fritz Göbel and Thomas Grützmacher and Pratik Nayak and Tobias Ribizel and Yuhsiang Mike Tsai and Enrique S. Quintana-Ortí},
  title = {Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing},
  year = {2020}
}
Anzt H, Cojean T, Chen Y-C, Flegar G, Göbel F, Grützmacher T, Nayak P, Ribizel T and Tsai Y-H (2020), "Ginkgo: A high performance numerical linear algebra library", The Journal of Open Source Software.
Abstract: Ginkgo is a production-ready sparse linear algebra library for high performance computing on GPU-centric architectures with a high level of performance portability and focuses on software sustainability.\ The library focuses on solving sparse linear systems and accommodates a large variety of matrix formats, state-of-the-art iterative (Krylov) solvers and preconditioners, which make the library suitable for a variety of scientific applications. Ginkgo supports many architectures such as multi-threaded CPU, NVIDIA GPUs, and AMD GPUs. The heavy use of modern C++ features simplifies the addition of new executor paradigms and algorithmic functionality without introducing significant performance overhead.\ Solving linear systems is usually one of the most computationally and memory intensive aspects of any application. Hence there has been a significant amount of effort in this direction with software libraries such as UMFPACK and CHOLMOD (“Suitesparse,” 2020) for solving linear systems with direct methods and PETSc (“PETSc,” 2020), Trilinos (“Trilinos,” 2020), Eigen (“Eigen,” 2020) and many more to solve linear systems with iterative methods. With Ginkgo, we aim to ensure high performance while not compromising portability. Hence, we provide very efficient low level kernels optimized for different architectures and separate these kernels from the algorithms thereby ensuring extensibility and ease of use.\ Ginkgo is also a part of the xSDK effort (“xSDK,” 2020) and available as a Spack (Gamblin et al., 2015) package. xSDK aims to provide infrastructure for and interoperability between a collection of related and complementary software elements to foster rapid and efficient development of scientific applications using High Performance Computing. Within this effort, we provide interoperability with application libraries such as deal.ii (Arndt et al., 2019) and mfem (Anderson et al., 2020). Ginkgo provides wrappers within these two libraries so that they can take advantage of the features of Ginkgo
BibTeX:
@article{Anzt2020c,
  author = {Hartwig Anzt and Terry Cojean and Yen-Chen Chen and Goran Flegar and Fritz Göbel and Thomas Grützmacher and Pratik Nayak and Tobias Ribizel and Yu-Hsiang Tsai},
  title = {Ginkgo: A high performance numerical linear algebra library},
  journal = {The Journal of Open Source Software},
  year = {2020},
  doi = {10.21105/joss.02260}
}
Anzt H, Tsai YM, Abdelfattah A, Cojean T and Dongarra J (2020), "Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse and Batched Computations", In Proceedings of the 2020 IEEE/ACM Conference on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems.
Abstract: GPU accelerators have become an important backbone for scientific high performance-computing, and the performance advances obtained from adopting new GPU hardware are significant. In this paper we take a first look at NVIDIA's newest server-line GPU, the A100 architecture, part of the Ampere generation. Specifically, we assess its performance for sparse and batch computations, as these routines are relied upon in many scientific applications, and compare to the performance achieved on NVIDIA's previous server-line GPU.
BibTeX:
@inproceedings{Anzt2020d,
  author = {Hartwig Anzt and Yuhsiang M. Tsai and Ahmad Abdelfattah and Terry Cojean and Jack Dongarra},
  title = {Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse and Batched Computations},
  booktitle = {Proceedings of the 2020 IEEE/ACM Conference on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems},
  year = {2020}
}
Anzt H, Kuehn E and Flegar G (2020), "Crediting Pull Requests to Open Source Research Software as an Academic Contribution", Journal of Computational Science. , pp. 101278.
Abstract: Like any other scientific discipline, the High Performance Computing community suffers under the publish or perish paradigm. As a result, a significant portion of novel algorithm designs and hardware-optimized implementations never make it into production code but are instead abandoned once they served the purpose of yielding (another) publication. At the same time, community software packages driving scientific research lack the addition of new technology and hardware-specific implementations. This results in a very unsatisfying situation where researchers and software developers are working independently, and the traditional peer reviewing is reaching its capacity limits. A paradigm shift that accepts high-quality software pull requests to open source research software as conference contributions may create incentives to realize new and/or improved algorithms in community software ecosystems. In this paper, we propose to complement code reviews on pull requests to scientific open source software with scientific reviews, and allow the presentation and publication of high quality software contributions that present an academic improvement to the state-of-the-art at scientific conferences.
BibTeX:
@article{Anzt2020e,
  author = {Hartwig Anzt and Eileen Kuehn and Goran Flegar},
  title = {Crediting Pull Requests to Open Source Research Software as an Academic Contribution},
  journal = {Journal of Computational Science},
  year = {2020},
  pages = {101278},
  url = {http://www.sciencedirect.com/science/article/pii/S1877750320305743},
  doi = {10.1016/j.jocs.2020.101278}
}
Archibald R, Chow E, D'Azevedo E, Dongarra J, Eisenbach M, Febbo R, Lopez F, Nichols D, Tomov S, Wong K and Yin J (2020), "Integrating Deep Learning in Domain Sciencesat Exascale". Thesis at: University of Tennessee.
Abstract: This paper presents some of the current challenges in designing deep learning artificial intelligence (AI) and integrating it with traditional high-performance computing (HPC) simulations. We evaluate existing packages for their ability to run deep learning models and applications on large-scale HPC systems eciently, identify challenges, and propose new asynchronous parallelization and optimization techniques for current large-scale heterogeneous systems and upcoming exascale systems. These developments, along with existing HPC AI software capabilities, have been integrated into MagmaDNN, an open-source HPC deep learning framework. Many deep learning frameworks are targeted at data scientists and fall short in providing quality integration into existing HPC workflows. This paper discusses the necessities of an HPC deep learning framework and how those needs can be provided (e.g., as in MagmaDNN) through a deep integration with existing HPC libraries, such as MAGMA and its modular memory management, MPI, CuBLAS, CuDNN, MKL, and HIP. Advancements are also illustrated through the use of algorithmic enhancements in reduced- and mixed-precision, as well as asynchronous optimization methods. Finally, we present illustrations and potential solutions for enhancing traditional compute- and data-intensive applications at ORNL and UTK with AI. The approaches and future challenges are illustrated in materials science, imaging, and climate applications.
BibTeX:
@techreport{Archibald2020,
  author = {Rick Archibald and Edmond Chow and Eduardo D'Azevedo and Jack Dongarra and Markus Eisenbach and Rocco Febbo and Florent Lopez and Daniel Nichols and Stanimire Tomov and Kwai Wong and Junqi Yin},
  title = {Integrating Deep Learning in Domain Sciencesat Exascale},
  school = {University of Tennessee},
  year = {2020},
  url = {https://www.icl.utk.edu/files/publications/2020/icl-utk-1403-2020.pdf}
}
Asgari B, Hadidi R, Dierberger J, Steinichen C and Kim H (2020), "Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads", November, 2020.
Abstract: Sparse matrices are the key ingredients of several application domains, from scientific computation to machine learning. The primary challenge with sparse matrices has been efficiently storing and transferring data, for which many sparse formats have been proposed to significantly eliminate zero entries. Such formats, essentially designed to optimize memory footprint, may not be as successful in performing faster processing. In other words, although they allow faster data transfer and improve memory bandwidth utilization -- the classic challenge of sparse problems -- their decompression mechanism can potentially create a computation bottleneck. Not only is this challenge not resolved, but also it becomes more serious with the advent of domain-specific architectures (DSAs), as they intend to more aggressively improve performance. The performance implications of using various formats along with DSAs, however, has not been extensively studied by prior work. To fill this gap of knowledge, we characterize the impact of using seven frequently used sparse formats on performance, based on a DSA for sparse matrix-vector multiplication (SpMV), implemented on an FPGA using high-level synthesis (HLS) tools, a growing and popular method for developing DSAs. Seeking a fair comparison, we tailor and optimize the HLS implementation of decompression for each format. We thoroughly explore diverse metrics, including decompression overhead, latency, balance ratio, throughput, memory bandwidth utilization, resource utilization, and power consumption, on a variety of real-world and synthetic sparse workloads.
BibTeX:
@article{Asgari2020,
  author = {Bahar Asgari and Ramyad Hadidi and Joshua Dierberger and Charlotte Steinichen and Hyesoon Kim},
  title = {Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads},
  year = {2020}
}
Ashcraft C, Buttari A, Mary ThéoAshcraft C, Buttari A and Mary T (2020), "Block Low-Rank Matrices with Shared Bases: Potential and Limitations of the BLR^2 Format". Thesis at: Institut de recherche en informatique de Toulouse (IRIT).
Abstract: We investigate a special class of data sparse rank-structured matrices that combine a flat block low-rank (BLR) partitioning with the use of shared (called nested in the hierarchical case) bases. This format is to ℋ^2 matrices what BLR is to ℋ matrices: we therefore call it the BLR2 matrix format. We present algorithms for the construction and LU factorization of BLR2 matrices, and perform their cost analysis—both asymptotically and for a fixed problem size. With weak admissibility, BLR^2 matrices reduce to block separable matrices (the flat version of HBS/HSS). Our analysis and numerical experiments reveal some limitations of BLR^2 matrices with weak admissibility, which we propose to overcome with two approaches: strong admissibility, and the use of multiple shared bases per row and column.
BibTeX:
@techreport{Ashcraft2020,
  author = {Ashcraft, Cleve and Buttari, Alfredo and Mary, ThéoAshcraft, Cleve and Buttari, Alfredo and Mary, Théo},
  title = {Block Low-Rank Matrices with Shared Bases: Potential and Limitations of the BLR^2 Format},
  school = {Institut de recherche en informatique de Toulouse (IRIT)},
  year = {2020},
  url = {https://hal.archives-ouvertes.fr/hal-03070416}
}
Attouch H, Chbani Z and Riahi H (2020), "Fast Convex Optimization Via a Third-Order In Time Evolution Equation"
Abstract: In a Hilbert space H, we develop fast convex optimization methods, which are based on a third order in time evolution system. The function to minimize f : ℋ → ℝ is convex, continuously differentiable, with argmin f ≠ ∅, and enters the dynamic via its gradient. On the basis of Lyapunov's analysis and temporal scaling techniques, we show a convergence rate of the values of the order 1t^3 , and obtain the convergence of the trajectories towards optimal solutions. When f is strongly convex, an exponential rate of convergence is obtained. We complete the study of the continuous dynamic by introducing a damping term induced by the Hessian of f. This allows the oscillations to be controlled and attenuated. Then, we analyze the convergence of the proximal-based algorithms obtained by temporal discretization of this system, and obtain similar convergence rates. The algorithmic results are valid for a general convex, lower semicontinuous, and proper function f : ℋ → ℝ ∪ +\infty \.
BibTeX:
@article{Attouch2020,
  author = {Hedy Attouch and Zaki Chbani and Hassan Riahi},
  title = {Fast Convex Optimization Via a Third-Order In Time Evolution Equation},
  year = {2020}
}
Awan MG, Deslippe J, Buluc A, Selvitopi O, Hofmeyr S, Oliker L and Yelick K (2020), "ADEPT: a domain independent sequence alignment strategy for gpu architectures", BMC Bioinformatics., 9, 2020. Vol. 21(1) Springer Science and Business Media LLC.
Abstract: Bioinformatic workflows frequently make use of automated genome assembly and protein clustering tools. At the core of most of these tools, a significant portion of execution time is spent in determining optimal local alignment between two sequences. This task is performed with the Smith-Waterman algorithm, which is a dynamic programming based method. With the advent of modern sequencing technologies and increasing size of both genome and protein databases, a need for faster Smith-Waterman implementations has emerged. Multiple SIMD strategies for the Smith-Waterman algorithm are available for CPUs. However, with the move of HPC facilities towards accelerator based architectures, a need for an efficient GPU accelerated strategy has emerged. Existing GPU based strategies have either been optimized for a specific type of characters (Nucleotides or Amino Acids) or for only a handful of application use-cases.
BibTeX:
@article{Awan2020,
  author = {Muaaz G. Awan and Jack Deslippe and Aydin Buluc and Oguz Selvitopi and Steven Hofmeyr and Leonid Oliker and Katherine Yelick},
  title = {ADEPT: a domain independent sequence alignment strategy for gpu architectures},
  journal = {BMC Bioinformatics},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  volume = {21},
  number = {1},
  doi = {10.1186/s12859-020-03720-1}
}
Awwal AM, Kumam P and Mohammad H (2020), "Iterative algorithm with structured diagonal Hessian approximation for solving nonlinear least squares problems", February, 2020.
Abstract: Nonlinear least-squares problems are a special class of unconstrained optimization problems in which their gradient and Hessian have special structures. In this paper, we exploit these structures and proposed a matrix-free algorithm with a diagonal Hessian approximation for solving nonlinear least-squares problems. We devise appropriate safeguarding strategies to ensure the Hessian matrix is positive definite throughout the iteration process. The proposed algorithm generates descent direction and is globally convergent. Preliminary numerical experiments show that the proposed method is competitive with a recently developed similar method.
BibTeX:
@article{Awwal2020,
  author = {Aliyu Muhammed Awwal and Poom Kumam and Hassan Mohammad},
  title = {Iterative algorithm with structured diagonal Hessian approximation for solving nonlinear least squares problems},
  year = {2020}
}
Ayala A, Tomov S, Haidar A and Dongarra J (2020), "heFFTe: Highly Efficient FFT for Exascale", In Lecture Notes in Computer Science. , pp. 262-275. Springer International Publishing.
Abstract: Exascale computing aspires to meet the increasing demands from large scientific applications. Software targeting exascale is typically designed for heterogeneous architectures; henceforth, it is not only important to develop well-designed software, but also make it aware of the hardware architecture and efficiently exploit its power. Currently, several and diverse applications, such as those part of the Exascale Computing Project (ECP) in the United States, rely on efficient computation of the Fast Fourier Transform (FFT). In this context, we present the design and implementation of heFFTe (Highly Efficient FFT for Exascale) library, which targets the upcoming exascale supercomputers. We provide highly (linearly) scalable GPU kernels that achieve more than equation 40× speedup with respect to local kernels from CPU state-of-the-art libraries, and over equation 2× speedup for the whole FFT computation. A communication model for parallel FFTs is also provided to analyze the bottleneck for large-scale problems. We show experiments obtained on Summit supercomputer at Oak Ridge National Laboratory, using up to 24,576 IBM Power9 cores and 6,144 NVIDIA V-100 GPUs.
BibTeX:
@incollection{Ayala2020,
  author = {Alan Ayala and Stanimire Tomov and Azzam Haidar and Jack Dongarra},
  title = {heFFTe: Highly Efficient FFT for Exascale},
  booktitle = {Lecture Notes in Computer Science},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {262--275},
  doi = {10.1007/978-3-030-50371-0_19}
}
Azad A, Buluç A, Li XS, Wang X and Langguth J (2020), "A Distributed-Memory Algorithm for Computing a Heavy-Weight Perfect Matching on Bipartite Graphs", SIAM Journal on Scientific Computing., 1, 2020. Vol. 42(4), pp. C143-C168. Society for Industrial & Applied Mathematics (SIAM).
Abstract: We design and implement an efficient parallel algorithm for finding a perfect matching in a weighted bipartite graph such that weights on the edges of the matching are large. This problem differs from the maximum weight matching problem, for which scalable approximation algorithms are known. It is primarily motivated by finding good pivots in scalable sparse direct solvers before factorization. Due to the lack of scalable alternatives, distributed solvers use sequential implementations of maximum weight perfect matching algorithms, such as those available in MC64. To overcome this limitation, we propose a fully parallel distributed memory algorithm that first generates a perfect matching and then iteratively improves the weight of the perfect matching by searching for weight-increasing cycles of length 4 in parallel. For most practical problems the weights of the perfect matchings generated by our algorithm are very close to the optimum. An efficient implementation of the algorithm scales up to 256 nodes (17,408 cores) on a Cray XC40 supercomputer and can solve instances that are too large to be handled by a single node using the sequential algorithm.
BibTeX:
@article{Azad2020,
  author = {Ariful Azad and Aydin Buluç and Xiaoye S. Li and Xinliang Wang and Johannes Langguth},
  title = {A Distributed-Memory Algorithm for Computing a Heavy-Weight Perfect Matching on Bipartite Graphs},
  journal = {SIAM Journal on Scientific Computing},
  publisher = {Society for Industrial & Applied Mathematics (SIAM)},
  year = {2020},
  volume = {42},
  number = {4},
  pages = {C143--C168},
  doi = {10.1137/18m1189348}
}
Azad A, Aznaveh MM, Beamer S, Blanco M, Chen J, D'Alessandro L, Dathathri R, Davis T, Deweese K, Firoz J, Gabb HA, Gill G, Hegyi B, Kolodzie S, Low TM, Lumsdaine A, Manlaibaatar T, Mattson TG, McMillan S, Peri R, Pingali K, Sridhar U, Szarnyas G, Zhang Y and Zhang Y (2020), "Evaluation of Graph Analytics Frameworks Using the GAP Benchmark Suite,", In Proceedings of the 2020 IEEE International Symposium on Workload Characterization.
Abstract: Graphs play a key role in data analytics. Graphs and the software systems used to work with them are highly diverse. Algorithms interact with hardware in different ways and which graph solution works best on a given platform changes with the structure of the graph. This makes it difficult to decide which graph programming framework is the best for a given situation. In this paper, we try to make sense of this diverse landscape. We evaluate five different frameworks for graph analytics: SuiteSparse GraphBLAS, Galois, the NWGraph library, the Graph Kernel Collection (GKC), and GraphIt. We use the GAP Benchmark Suite to evaluate each framework. GAP consists of 30 tests: six graph algorithms (breadth-first search, single-source shortest path, PageRank, betweenness centrality, connected components, and triangle counting) on five graphs. The GAP Benchmark Suite includes high-performance reference implementations to provide a performance baseline for comparison. Our results show the relative strengths of each framework, but also serve as a case study for the challenges of establishing objective measures for comparing graph frameworks.
BibTeX:
@inproceedings{Azad2020a,
  author = {Ariful Azad and Mohsen Mahmoudi Aznaveh and Scott Beamer and Mark Blanco and Jinhao Chen and Luke D'Alessandro and Roshan Dathathri and Tim Davis and Kevin Deweese and Jesun Firoz and Henry A Gabb and Gurbinder Gill and Balint Hegyi and Scott Kolodzie and Tze Meng Low and Andrew Lumsdaine and Tugsbayasgalan Manlaibaatar and Timothy G Mattson and Scott McMillan and Ramesh Peri and Keshav Pingali and Upasana Sridhar and Gabor Szarnyas and Yunming Zhang and Yongzhe Zhang},
  title = {Evaluation of Graph Analytics Frameworks Using the GAP Benchmark Suite,},
  booktitle = {Proceedings of the 2020 IEEE International Symposium on Workload Characterization},
  year = {2020},
  url = {https://www.cs.utexas.edu/ roshan/GraphAnalyticsFrameworksStudy.pdf}
}
Baayen J and Marecek J (2020), "Mixed-Integer Path-Stable Optimisation, with Applications in Model-Predictive Control of Water Systems", January, 2020.
Abstract: Many systems exhibit a mixture of continuous and discrete dynamics. We consider a family of mixed-integer non-convex non-linear optimisation problems obtained in discretisations of optimal control of such systems. For this family, a branch-and-bound algorithm solves the discretised problem to global optimality. As an example, we consider water systems, where variations in flow and variations in water levels are continuous, while decisions related to fixed-speed pumps and whether gates that may be opened and closed are discrete. We show that the related optimal-control problems come from the family we introduce -- and implement deterministic solvers with global convergence guarantees.
BibTeX:
@article{Baayen2020,
  author = {Jorn Baayen and Jakub Marecek},
  title = {Mixed-Integer Path-Stable Optimisation, with Applications in Model-Predictive Control of Water Systems},
  year = {2020}
}
Barik R, Minutoli M, Halappanavar M, Tallent NR and Kalyanaraman A (2020), "Vertex Reordering for Real-World Graphs and Applications: An Empirical Evaluation", In Proceedings of the International Symposium on Workload Characterization., 10, 2020. IEEE.
Abstract: Vertex reordering is a way to improve locality in graph computations. Given an input (or “natural”) order, reordering aims to compute an alternate permutation of the vertices that is aimed at maximizing a locality-based objective. Given decades of research on this topic, there are tens of graph reordering schemes, and there are also several linear arrangement “gap” measures for treatment as objectives. However, a comprehensive empirical analysis of the efficacy of the ordering schemes against the different gap measures, and against real-world applications is currently lacking. In this study, we present an extensive empirical evaluation of up to 11 ordering schemes, taken from different classes of approaches, on a set of 34 real-world graphs emerging from different application domains. Our study is presented in two parts: a) a thorough comparative evaluation of the different ordering schemes on their effectiveness to optimize different linear arrangement gap measures, relevant to preserving locality; and b) extensive evaluation of the impact of the ordering schemes on two real-world, parallel graph applications, namely, community detection and influence maximization. Our studies show a significant divergence among the ordering schemes (up to 40x between the best and the poor) in their effectiveness to reduce the gap measures; and a wide ranging impact of the ordering schemes on various aspects including application runtime (up to 4x), memory and cache use, load balancing, and parallel work and efficiency. The comparative study also helps in revealing the nuances of a parallel environment (compared to serial) on the ordering schemes and their role in optimizing applications.
BibTeX:
@inproceedings{Barik2020,
  author = {Reet Barik and Marco Minutoli and Mahantesh Halappanavar and Nathan R. Tallent and Ananth Kalyanaraman},
  title = {Vertex Reordering for Real-World Graphs and Applications: An Empirical Evaluation},
  booktitle = {Proceedings of the International Symposium on Workload Characterization},
  publisher = {IEEE},
  year = {2020},
  doi = {10.1109/iiswc50251.2020.00031}
}
Bauch J and Nadler B (2020), "Rank 2r iterative least squares: efficient recovery of ill-conditioned low rank matrices from few entries", February, 2020.
Abstract: We present a new, simple and computationally efficient iterative method for low rank matrix completion. Our method is inspired by the class of factorization-type iterative algorithms, but substantially differs from them in the way the problem is cast. Precisely, given a target rank r, instead of optimizing on the manifold of rank r matrices, we allow our interim estimated matrix to have a specific over-parametrized rank 2r structure. Our algorithm, denoted R2RILS, for rank 2r iterative least squares, thus has low memory requirements, and at each iteration it solves a computationally cheap sparse least-squares problem. We motivate our algorithm by its theoretical analysis for the simplified case of a rank-1 matrix. Empirically, R2RILS is able to recover, with machine precision, ill conditioned low rank matrices from very few observations -- near the information limit. Finally, R2RILS is stable to corruption of the observed entries by additive zero mean Gaussian noise.
BibTeX:
@article{Bauch2020,
  author = {Jonathan Bauch and Boaz Nadler},
  title = {Rank 2r iterative least squares: efficient recovery of ill-conditioned low rank matrices from few entries},
  year = {2020}
}
Beeumen RV, Ibrahim KZ, Kahanamoku-Meyer GD, Yao NY and Yang C (2020), "Enhancing Scalability of a Matrix-Free Eigensolver for Studying Many-Body Localization", December, 2020.
Abstract: In [Van Beeumen, et. al, HPC Asia 2020, https://www.doi.org/10.1145/3368474.3368497] a scalable and matrix-free eigensolver was proposed for studying the many-body localization (MBL) transition of two-level quantum spin chain models with nearest-neighbor XX+YY interactions plus Z terms. This type of problem is computationally challenging because the vector space dimension grows exponentially with the physical system size, and averaging over different configurations of the random disorder is needed to obtain relevant statistical behavior. For each eigenvalue problem, eigenvalues from different regions of the spectrum and their corresponding eigenvectors need to be computed. Traditionally, the interior eigenstates for a single eigenvalue problem are computed via the shift-and-invert Lanczos algorithm. Due to the extremely high memory footprint of the LU factorizations, this technique is not well suited for large number of spins L, e.g., one needs thousands of compute nodes on modern high performance computing infrastructures to go beyond L = 24. The matrix-free approach does not suffer from this memory bottleneck, however, its scalability is limited by a computation and communication imbalance. We present a few strategies to reduce this imbalance and to significantly enhance the scalability of the matrix-free eigensolver. To optimize the communication performance, we leverage the consistent space runtime, CSPACER, and show its efficiency in accelerating the MBL irregular communication patterns at scale compared to optimized MPI non-blocking two-sided and one-sided RMA implementation variants. The efficiency and effectiveness of the proposed algorithm is demonstrated by computing eigenstates on a massively parallel many-core high performance computer.
BibTeX:
@article{Beeumen2020,
  author = {Roel Van Beeumen and Khaled Z. Ibrahim and Gregory D. Kahanamoku-Meyer and Norman Y. Yao and Chao Yang},
  title = {Enhancing Scalability of a Matrix-Free Eigensolver for Studying Many-Body Localization},
  year = {2020}
}
Bellavia S and Gurioli G (2020), "Complexity Analysis of a Stochastic Cubic Regularisation Method under Inexact Gradient Evaluations and Dynamic Hessian Accuracy", January, 2020.
Abstract: We here adapt an extended version of the adaptive cubic regularisation method with dynamic inexact Hessian information for nonconvex optimisation in [2] to the stochastic optimisation setting. While exact function evaluations are still considered, this novel variant inherits the innovative use of adaptive accuracy requirements for Hessian approximations introduced in [2] and additionally employs inexact computations of the gradient. Without restrictions on the variance of the errors, we assume that these approximations are available within a sufficiently large, but fixed, probability and we extend, in the spirit of [13], the deterministic analysis of the framework to its stochastic counterpart, showing that the expected number of iterations to reach a first-order stationary point matches the well known worst-case optimal complexity. This is, in fact, still given by O(epsilon^(-3/2)), with respect to the first-order epsilon tolerance.
BibTeX:
@article{Bellavia2020,
  author = {Stefania Bellavia and Gianmarco Gurioli},
  title = {Complexity Analysis of a Stochastic Cubic Regularisation Method under Inexact Gradient Evaluations and Dynamic Hessian Accuracy},
  year = {2020}
}
Bemporad A and Cimini G (2020), "Reduction of the Number of Variables in Parametric Constrained Least-Squares Problems", December, 2020.
Abstract: For linearly constrained least-squares problems that depend on a vector of parameters, this paper proposes techniques for reducing the number of involved optimization variables. After first eliminating equality constraints in a numerically robust way by QR factorization, we propose a technique based on singular value decomposition (SVD) and unsupervised learning, that we call K-SVD, and neural classifiers to automatically partition the set of parameter vectors in K nonlinear regions in which the original problem is approximated by using a smaller set of variables. For the special case of parametric constrained least-squares problems that arise from model predictive control (MPC) formulations, we propose a novel and very efficient QR factorization method for equality constraint elimination. Together with SVD or K-SVD, the method provides a numerically robust alternative to standard condensing and move blocking, and to other complexity reduction methods for MPC based on basis functions. We show the good performance of the proposed techniques in numerical tests and in a linearized MPC problem of a nonlinear benchmark process.
BibTeX:
@article{Bemporad2020,
  author = {Alberto Bemporad and Gionata Cimini},
  title = {Reduction of the Number of Variables in Parametric Constrained Least-Squares Problems},
  year = {2020}
}
Bereznyi D, Qutbuddin A, Her Y and Yang K (2020), "Node-attributed Spatial Graph Partitioning", In Proceedings of the 28th International Conference on Advances in Geographic Information Systems.
Abstract: Given a spatial graph and a set of node attributes, the Node-attributed Spatial Graph Partitioning (NSGP) problem partitions a node-attributed spatial graph into k homogeneous sub-graphs that minimize both the total RMSErank1 and edge-cuts while meeting a size constraint on the sub-graphs. RMSErank1 is the Root Mean Square Error between a matrix and its rank-one decomposition. The NSGP problem is important for many societal applications such as identifying homogeneous communities in a spatial graph and detecting interrelated patterns in traffic accidents. This problem is NP-hard; it is computationally challenging because of the large size of spatial graphs and the constraint that the sub-graphs must be homogeneous, i.e. similar in terms of node attributes. This paper proposes a novel approach for finding a set of homogeneous sub-graphs that can minimize both the total RMSErank1 and edge-cuts while meeting the size constraint. Experiments and a case study using U.S. Census datasets and HP#6 watershed network datasets demonstrate that the proposed approach partitions a spatial graph into a set of homogeneous sub-graphs and reduces the computational cost.
BibTeX:
@inproceedings{Bereznyi2020,
  author = {Daniel Bereznyi and Ahmad Qutbuddin and YoungGu Her and KwangSoo Yang},
  title = {Node-attributed Spatial Graph Partitioning},
  booktitle = {Proceedings of the 28th International Conference on Advances in Geographic Information Systems},
  year = {2020},
  doi = {10.1145/3397536.3422198}
}
Bergamaschi L, Marin J and Martinez A (2020), "Compact Quasi-Newton preconditioners for SPD linear systems", January, 2020.
Abstract: In this paper preconditioners for the Conjugate Gradient method are studied to solve the Newton system with symmetric positive definite Jacobian. In particular, we define a sequence of preconditioners built by means of SR1 and BFGS low-rank updates. We develop conditions under which the SR1 update maintains the preconditioner SPD. Spectral analysis of the SR1 preconditioned Jacobians shows an improved eigenvalue distribution as the Newton iteration proceeds. A compact matrix formulation of the preconditioner update is developed which reduces the cost of its application and is more suitable for parallel implementation. Some notes on the implementation of the corresponding Inexact Newton method are given and numerical results on a number of model problems illustrate the efficiency of the proposed preconditioners.
BibTeX:
@article{Bergamaschi2020,
  author = {Luca Bergamaschi and Jose Marin and Angeles Martinez},
  title = {Compact Quasi-Newton preconditioners for SPD linear systems},
  year = {2020}
}
Bergamaschi L (2020), "A Survey of Low-Rank Updates of Preconditioners for Sequences of Symmetric Linear Systems", Algorithms., 4, 2020. Vol. 13(4), pp. 100. MDPI AG.
Abstract: The aim of this survey is to review some recent developments in devising efficient preconditioners for sequences of symmetric positive definite (SPD) linear systems A_k x_k = b_k,, k = 1, dots arising in many scientific applications, such as discretization of transient Partial Differential Equations (PDEs), solution of eigenvalue problems, (Inexact) Newton methods applied to nonlinear systems, rational Krylov methods for computing a function of a matrix. In this paper, we will analyze a number of techniques of updating a given initial preconditioner by a low-rank matrix with the aim of improving the clustering of eigenvalues around 1, in order to speed-up the convergence of the Preconditioned Conjugate Gradient (PCG) method. We will also review some techniques to efficiently approximate the linearly independent vectors which constitute the low-rank corrections and whose choice is crucial for the effectiveness of the approach. Numerical results on real-life applications show that the performance of a given iterative solver can be very much enhanced by the use of low-rank updates.
BibTeX:
@article{Bergamaschi2020a,
  author = {Luca Bergamaschi},
  title = {A Survey of Low-Rank Updates of Preconditioners for Sequences of Symmetric Linear Systems},
  journal = {Algorithms},
  publisher = {MDPI AG},
  year = {2020},
  volume = {13},
  number = {4},
  pages = {100},
  doi = {10.3390/a13040100}
}
Berger GO, Absil PA, Jungers RM and Nesterov Y (2020), "On the Quality of First-Order Approximation of Functions with Hölder Continuous Gradient", January, 2020.
Abstract: We show that Hölder continuity of the gradient is not only a sufficient condition, but also a necessary condition for the existence of a global upper bound on the error of the first-order Taylor approximation. We also relate this global upper bound to the Hölder constant of the gradient. This relation is expressed as an interval, depending on the Hölder constant, in which the error of the first-order Taylor approximation is guaranteed to be. We show that, for the Lipschitz continuous case, the interval cannot be reduced. An application to the norms of quadratic forms is proposed, which allows us to derive a novel characterization of Euclidean norms.
BibTeX:
@article{Berger2020,
  author = {Guillaume O. Berger and P. -A. Absil and Raphaël M. Jungers and Yurii Nesterov},
  title = {On the Quality of First-Order Approximation of Functions with Hölder Continuous Gradient},
  year = {2020}
}
Bergou E, Diouane Y and Kungurtsev V (2020), "Convergence and Complexity Analysis of a Levenberg-Marquardt Algorithm for Inverse Problems", JOTA, 2020., April, 2020.
Abstract: The Levenberg-Marquardt algorithm is one of the most popular algorithms for finding the solution of nonlinear least squares problems. Across different modified variations of the basic procedure, the algorithm enjoys global convergence, a competitive worst case iteration complexity rate, and a guaranteed rate of local convergence for both zero and nonzero small residual problems, under suitable assumptions. We introduce a novel Levenberg-Marquardt method that matches, simultaneously, the state of the art in all of these convergence properties with a single seamless algorithm. Numerical experiments confirm the theoretical behavior of our proposed algorithm.
BibTeX:
@article{Bergou2020,
  author = {E. Bergou and Y. Diouane and V. Kungurtsev},
  title = {Convergence and Complexity Analysis of a Levenberg-Marquardt Algorithm for Inverse Problems},
  journal = {JOTA, 2020},
  year = {2020}
}
Bernaschi M, D'Ambra P and Pasquini D (2020), "BootCMatchG: An adaptive Algebraic MultiGrid linear solver for GPUs", Software Impacts., November, 2020. , pp. 100041. Elsevier BV.
Abstract: Sparse solvers are one of the building blocks of any technology for reliable and high-performance scientific and engineering computing. In this paper we present a software package which implements an efficient multigrid sparse solver running on Graphics Processing Units. The package is a branch of a wider initiative of software development for sparse Linear Algebra computations on emergent HPC architectures involving a large research group working in many application projects over the last ten years.
BibTeX:
@article{Bernaschi2020,
  author = {Massimo Bernaschi and Pasqua D'Ambra and Dario Pasquini},
  title = {BootCMatchG: An adaptive Algebraic MultiGrid linear solver for GPUs},
  journal = {Software Impacts},
  publisher = {Elsevier BV},
  year = {2020},
  pages = {100041},
  doi = {10.1016/j.simpa.2020.100041}
}
Bertsimas D, Cory-Wright R and Pauphilet J (2020), "Solving Large-Scale Sparse PCA to Certifiable (Near) Optimality", May, 2020.
Abstract: Sparse principal component analysis (PCA) is a popular dimensionality reduction technique for obtaining principal components which are linear combinations of a small subset of the original features. Existing approaches cannot supply certifiably optimal principal components with more than p=100s covariates. By reformulating sparse PCA as a convex mixed-integer semidefinite optimization problem, we design a cutting-plane method which solves the problem to certifiable optimality at the scale of selecting k=10s covariates from p=300 variables, and provides small bound gaps at a larger scale. We also propose two convex relaxations and randomized rounding schemes that provide certifiably near-exact solutions within minutes for p=100s or hours for p=1,000s. Using real-world financial and medical datasets, we illustrate our approach's ability to derive interpretable principal components tractably at scale.
BibTeX:
@article{Bertsimas2020,
  author = {Dimitris Bertsimas and Ryan Cory-Wright and Jean Pauphilet},
  title = {Solving Large-Scale Sparse PCA to Certifiable (Near) Optimality},
  year = {2020}
}
Besta M, Carigiet A, Vonarburg-Shmaria Z, Janda K, Gianinazzi L and Hoefler T (2020), "High-Performance Parallel Graph Coloring with Strong Guarantees on Work, Depth, and Quality", Proceedings of the ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis, November 2020., August, 2020.
Abstract: We develop the first parallel graph coloring heuristics with strong theoretical guarantees on work and depth and coloring quality. The key idea is to design a relaxation of the vertex degeneracy order, a well-known graph theory concept, and to color vertices in the order dictated by this relaxation. This introduces a tunable amount of parallelism into the degeneracy ordering that is otherwise hard to parallelize. This simple idea enables significant benefits in several key aspects of graph coloring. For example, one of our algorithms ensures polylogarithmic depth and a bound on the number of used colors that is superior to all other parallelizable schemes, while maintaining work-efficiency. In addition to provable guarantees, the developed algorithms have competitive run-times for several real-world graphs, while almost always providing superior coloring quality. Our degeneracy ordering relaxation is of separate interest for algorithms outside the context of coloring.
BibTeX:
@article{Besta2020,
  author = {Maciej Besta and Armon Carigiet and Zur Vonarburg-Shmaria and Kacper Janda and Lukas Gianinazzi and Torsten Hoefler},
  title = {High-Performance Parallel Graph Coloring with Strong Guarantees on Work, Depth, and Quality},
  journal = {Proceedings of the ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis, November 2020},
  year = {2020}
}
Bian H, Huang J, Liu L, Huang D and Wang X (2020), "ALBUS: A method for efficiently processing SpMV using SIMD and Load balancing", Future Generation Computer Systems., 11, 2020. Elsevier BV.
Abstract: SpMV (Sparse matrix-vector multiplication) is widely used in many fields. Improving the performance of SpMV has been the pursuit of many researchers. Parallel SpMV using multi-core processors has been a standard parallel method used by researchers. In reality, the number of non-zero elements in many sparse matrices is not evenly distributed, so parallelism without preprocessing will cause a large amount of performance loss due to uneven load. In this paper, we propose ALBUS (Absolute Load Balancing Using SIMD (Single Instruction Multiple Data)), a method for efficiently processing SpMV using load balancing and SIMD vectorization. On the one hand, ALBUS can achieve multi-core balanced load processing; on the other hand, it gives full play to the ability of SIMD vectorization parallelism under the CPU. We selected 20 sets of regular matrices and 20 sets of irregular matrices to form the Benchmark suite. We performed SpMV performance comparison tests on ALBUS, CSR5 (Compressed Sparse Row 5), Merge(Merge-based SpMV), and MKL (Math Kernel Library) under the same conditions. On the E5-2670 v3 CPU platform, For 20 sets of regular matrices, ALBUS can achieve an average speedup of 1.59x, 1.32x, 1.48x (up to 2.53x, 2.22x, 2.31x) compared to CSR5, Merge, MKL, respectively. For 20 sets of irregular matrices, ALBUS can achieve an average speedup of 1.38x, 1.42x, 2.44x (up to 2.33x, 2.24x, 5.37x) compared to CSR5, Merge, MKL, respectively.
BibTeX:
@article{Bian2020,
  author = {Haodong Bian and Jianqiang Huang and Lingbin Liu and Dongqiang Huang and Xiaoying Wang},
  title = {ALBUS: A method for efficiently processing SpMV using SIMD and Load balancing},
  journal = {Future Generation Computer Systems},
  publisher = {Elsevier BV},
  year = {2020},
  doi = {10.1016/j.future.2020.10.036}
}
Blanco MP, McMillan S and Low TM (2020), "Towards an Objective Metric for the Performance of Exact Triangle Count", September, 2020.
Abstract: The performance of graph algorithms is often measured in terms of the number of traversed edges per second (TEPS). However, this performance metric is inadequate for a graph operation such as exact triangle counting. In triangle counting, execution times on graphs with a similar number of edges can be distinctly different as demonstrated by results from the past Graph Challenge entries. We discuss the need for an objective performance metric for graph operations and the desired characteristics of such a metric such that it more accurately captures the interactions between the amount of work performed and the capabilities of the hardware on which the code is executed. Using exact triangle counting as an example, we derive a metric that captures how certain techniques employed in many implementations improve performance. We demonstrate that our proposed metric can be used to evaluate and compare multiple approaches for triangle counting, using a SIMD approach as a case study against a scalar baseline.
BibTeX:
@article{Blanco2020,
  author = {Mark P. Blanco and Scott McMillan and Tze Meng Low},
  title = {Towards an Objective Metric for the Performance of Exact Triangle Count},
  year = {2020}
}
Bogle I, Boman EG, Devine K and Slota GM (2020), "Distributed Memory Graph Coloring Algorithms forMultiple GPUs"
Abstract: Graph coloring is often used in parallelizing scientific computations that run in distributed and multi-GPU environments; it identifies sets of independent data that can be updated in parallel. Many algorithms exist for graph coloring on a single GPU or in distributed memory, but hybrid MPI+GPU algorithms have been unexplored until this work, to the best of our knowledge. We present several MPI+GPU coloring approaches that use implementations of the distributed coloring algorithms of Gebremedhin et al. and the shared-memory algorithms of Deveci et al. The on-node parallel coloring uses implementations in KokkosKernels, which provide parallelization for both multicore CPUs and GPUs. We further extend our approaches to solve for distance-2 coloring, giving the first known distributed and multiGPU algorithm for this problem. In addition, we propose novel methods to reduce communication in distributed graph coloring. Our experiments show that our approaches operate efficiently on inputs too large to fit on a single GPU and scale up to graphs with 76.7 billion edges running on 128 GPUs.
BibTeX:
@article{Bogle2020,
  author = {Ian Bogle and Erik G. Boman and Karen Devine and George M. Slota},
  title = {Distributed Memory Graph Coloring Algorithms forMultiple GPUs},
  year = {2020},
  url = {http://www.cs.rpi.edu/ slotag/pub/Coloring-IA320.pdf}
}
Boley D (2020), "On Fast Computation of Directed Graph Laplacian Pseudo-Inverse", February, 2020.
Abstract: The Laplacian matrix and its pseudo-inverse for a strongly connected directed graph is fundamental in computing many properties of a directed graph. Examples include random-walk centrality and betweenness measures, average hitting and commute times, and other connectivity measures. These measures arise in the analysis of many social and computer networks. In this short paper, we show how a linear system involving the Laplacian may be solved in time linear in the number of edges, times a factor depending on the separability of the graph. This leads directly to the column-by-column computation of the entire Laplacian pseudo-inverse in time quadratic in the number of nodes, i.e., constant time per matrix entry. The approach is based on "off-the-shelf" iterative methods for which global linear convergence is guaranteed, without recourse to any matrix elimination algorithm.
BibTeX:
@article{Boley2020,
  author = {Daniel Boley},
  title = {On Fast Computation of Directed Graph Laplacian Pseudo-Inverse},
  year = {2020}
}
Bolte J and Pauwels E (2020), "Curiosities and counterexamples in smooth convex optimization", January, 2020.
Abstract: Counterexamples to some old-standing optimization problems in the smooth convex coercive setting are provided. We show that block-coordinate, steepest descent with exact search or Bregman descent methods do not generally converge. Other failures of various desirable features are established: directional convergence of Cauchy's gradient curves, convergence of Newton's flow, finite length of Tikhonov path, convergence of central paths, or smooth Kurdyka-Lojasiewicz inequality. All examples are planar. These examples are based on general smooth convex interpolation results. Given a decreasing sequence of positively curved C k convex compact sets in the plane, we provide a level set interpolation of a C k smooth convex function where k ge 2 is arbitrary. If the intersection is reduced to one point our interpolant has positive definite Hessian, otherwise it is positive definite out of the solution set. Furthermore , given a sequence of decreasing polygons we provide an interpolant agreeing with the vertices and whose gradients coincide with prescribed normals.
BibTeX:
@article{Bolte2020,
  author = {Jerome Bolte and Edouard Pauwels},
  title = {Curiosities and counterexamples in smooth convex optimization},
  year = {2020}
}
Bonifati A, Dumbrava S and Kondylakis H (2020), "Graph Summarization", April, 2020.
Abstract: The continuous and rapid growth of highly interconnected datasets, which are both voluminous and complex, calls for the development of adequate processing and analytical techniques. One method for condensing and simplifying such datasets is graph summarization. It denotes a series of application-specific algorithms designed to transform graphs into more compact representations while preserving structural patterns, query answers, or specific property distributions. As this problem is common to several areas studying graph topologies, different approaches, such as clustering, compression, sampling, or influence detection, have been proposed, primarily based on statistical and optimization methods. The focus of our chapter is to pinpoint the main graph summarization methods, but especially to focus on the most recent approaches and novel research trends on this topic, not yet covered by previous surveys.
BibTeX:
@article{Bonifati2020,
  author = {Angela Bonifati and Stefania Dumbrava and Haridimos Kondylakis},
  title = {Graph Summarization},
  year = {2020}
}
Booth JD and Bolet G (2020), "An On-Node Scalable Sparse Incomplete LU Factorization for a Many-Core Iterative Solver with Javelin", Parallel Computing., 3, 2020. , pp. 102622. Elsevier BV.
Abstract: We present a scalable incomplete LU factorization to be used as a preconditioner for solving sparse linear systems with iterative methods in the package called Javelin. Javelin allows for improved parallel factorization on shared-memory many-core systems by packaging the coefficient matrix into a format that allows for high performance sparse matrix-vector multiplication and sparse triangular solves with minimal overheads. The framework achieves these goals by using a collection of traditional permutations, point-to-point thread synchronizations, tasking, and segmented prefix scans in a conventional compressed sparse row (CSR) format. Moreover, this framework stresses the importance of co-designing dependent tasks, such as sparse factorization and triangular solves, on highly-threaded architectures. We compare our method to the past distributed methods for incomplete factorization (Aztec) and current multithreaded packages (WSMP) in order to demonstrate the importance of having highly threaded factorizations on many-core systems. Using these changes, traditional fill-in and drop tolerance methods can be used, while still being able to have observed speedups of up to ∼ 42 × on 68 Intel Knights Landing cores and ∼ 12 × on 14 Intel Haswell cores. Moreover, this work provides insight into how the new data-structure impacts iteration counts, and provides insight into future improvements, such as point to GPUs.
BibTeX:
@article{Booth2020,
  author = {Joshua Dennis Booth and Gregory Bolet},
  title = {An On-Node Scalable Sparse Incomplete LU Factorization for a Many-Core Iterative Solver with Javelin},
  journal = {Parallel Computing},
  publisher = {Elsevier BV},
  year = {2020},
  pages = {102622},
  doi = {10.1016/j.parco.2020.102622}
}
Booth JD (2020), "Auto Adaptive Irregular OpenMP Loops", July, 2020.
Abstract: OpenMP is a standard for the parallelization due to the ease in programming parallel-for loops in a fork-join manner. Many shared-memory applications are implemented using this model despite not being ideal for applications with high load imbalance, such as those that make irregular memory accesses. One parameter, i.e., chunk size, is made available to users in order to mitigate performance loss. However, this parameter is dependent on architecture, system load, application, and input; making it difficult to tune. We present an OpenMP scheduler that does an adaptive tuning for chunk size for unbalanced applications that make irregular memory accesses. In particular, this method(iCh) uses work-stealing for imbalance and adapts chunk size using a force-feedback model that approximates variance of task length in a chunk. This scheduler has low overhead and allows for active load balancing while the applications are running. We demonstrate this using both sparse matrix-vector multiplication (spmv) and Betweenness Centrality (bc) and show that iCh can achieve average speedups close (i.e., within 1.061x for spmv and 1.092x for bc) of either OpenMP loops scheduled with dynamic or work-stealing methods that had chunk size tuned offline.
BibTeX:
@article{Booth2020a,
  author = {Joshua Dennis Booth},
  title = {Auto Adaptive Irregular OpenMP Loops},
  year = {2020}
}
Bosilca G, Harrison R, Herault T, Javanmard M, Nookala P and Valeev E (2020), "The Template Task Graph (TTG) -- an emergingpractical dataflow programming paradigm forscientific simulation at extreme scale", In Proceedings of the 2020 IEEE/ACM 5th International Workshop on Extreme Scale Programming Models and Middleware.
Abstract: We describe TESSE, an emerging general-purpose, open-source software ecosystem that attacks the twin challenges of programmer productivity and portable performance for advanced scientific applications on modern high-performance computers. TESSE builds upon and extends the PARSEC DAG/- dataflow runtime with a new Domain Specific Languages (DSL) and new integration capabilities. Motivating this work is our belief that such a dataflow model, perhaps with applications composed in domain specific languages, can overcome many of the challenges faced by a wide variety of irregular applications that are poorly served by current programming and execution models. Two such applications from many-body physics and applied mathematics are briefly explored. This paper focuses upon the Template Task Graph (TTG), which is TESSE's main C++ API that provides a powerful work/data-flow programming model. Algorithms on spatial trees, block-sparse tensors, and wave fronts are used to illustrate the API and associated concepts, as well as to compare with related approaches.
BibTeX:
@inproceedings{Bosilca2020,
  author = {G. Bosilca and R.J. Harrison and T. Herault and M.M. Javanmard and P. Nookala and E.F. Valeev},
  title = {The Template Task Graph (TTG) -- an emergingpractical dataflow programming paradigm forscientific simulation at extreme scale},
  booktitle = {Proceedings of the 2020 IEEE/ACM 5th International Workshop on Extreme Scale Programming Models and Middleware},
  year = {2020}
}
lena Botoeva, Kouvaros P, Kronqvist J, Lomuscio A and Misener R (2020), "Efficient Verification of ReLU-based Neural Networks via Dependency Analysis", In Proceedings of the 34th AAAI Conference on Artificial Intelligence.
Abstract: We introduce an efficient method for the verification of ReLU-based feed-forward neural networks. We derive an automated procedure that exploits dependency relations between the ReLU nodes, thereby pruning the search tree that needs to be considered by MILP-based formulations of the verification problem. We augment the resulting algorithm with methods for input domain splitting and symbolic interval propagation. We present Venus, the resulting verification toolkit, and evaluate it on the ACAS collision avoidance networks and models trained on the MNIST and CIFAR-10 datasets. The experimental results obtained indicate considerable gains over the present state-of-the-art tools.
BibTeX:
@inproceedings{Botoeva2020,
  author = {lena Botoeva and Panagiotis Kouvaros and Jan Kronqvist and Alessio Lomuscio and Ruth Misener},
  title = {Efficient Verification of ReLU-based Neural Networks via Dependency Analysis},
  booktitle = {Proceedings of the 34th AAAI Conference on Artificial Intelligence},
  year = {2020}
}
Boukaram W, Lucchesi M, Turkiyyah G, Ma\itre OL, Knio O and Keyes D (2020), "Hierarchical matrix approximations for space-fractional diffusion equations", Computer Methods in Applied Mechanics and Engineering., 9, 2020. Vol. 369, pp. 113191. Elsevier BV.
Abstract: Space fractional diffusion models generally lead to dense discrete matrix operators, which lead to substantial computational challenges when the system size becomes large. For a state of size N, full representation of a fractional diffusion matrix would require O(N^2) memory storage requirement, with a similar estimate for matrix–vector products. In this work, we present ℋ^2 matrix representation and algorithms that are amenable to efficient implementation on GPUs, and that can reduce the cost of storing these operators to O(N) asymptotically. Matrix–vector multiplications can be performed in asymptotically linear time as well. Performance of the algorithms is assessed in light of 2D simulations of space fractional diffusion equation with constant diffusivity. Attention is focused on smooth particle approximation of the governing equations, which lead to discrete operators involving explicit radial kernels. The algorithms are first tested using the fundamental solution of the unforced space fractional diffusion equation in an unbounded domain, and then for the steady, forced, fractional diffusion equation in a bounded domain. Both matrix-inverse and pseudo-transient solution approaches are considered in the latter case. Our experiments show that the construction of the fractional diffusion matrix, the matrix–vector multiplication, and the generation of an approximate inverse pre-conditioner all perform very well on a single GPU on 2D problems with N in the range 10^5 -- 10^6. In addition, the tests also showed that, for the entire range of parameters and fractional orders considered, results obtained using the ℋ^2 approximations were in close agreement with results obtained using dense operators, and exhibited the same spatial order of convergence. Overall, the present experiences showed that the ℋ^2 matrix framework promises to provide practical means to handle large-scale space fractional diffusion models in several space dimensions, at a computational cost that is asymptotically similar to the cost of handling classical diffusion equations.
BibTeX:
@article{Boukaram2020,
  author = {Wajih Boukaram and Marco Lucchesi and George Turkiyyah and Olivier Le Ma\itre and Omar Knio and David Keyes},
  title = {Hierarchical matrix approximations for space-fractional diffusion equations},
  journal = {Computer Methods in Applied Mechanics and Engineering},
  publisher = {Elsevier BV},
  year = {2020},
  volume = {369},
  pages = {113191},
  doi = {10.1016/j.cma.2020.113191}
}
Bramas B and Ketterlin A (2020), "Improving parallel executions by increasing task granularity in task-based runtime systems using acyclic DAG clustering", PeerJ Computer Science., 1, 2020. Vol. 6, pp. e247. PeerJ.
Abstract: The task-based approach is a parallelization paradigm in which an algorithm is transformed into a direct acyclic graph of tasks: the vertices are computational elements extracted from the original algorithm and the edges are dependencies between those. During the execution, the management of the dependencies adds an overhead that can become significant when the computational cost of the tasks is low. A possibility to reduce the makespan is to aggregate the tasks to make them heavier, while having fewer of them, with the objective of mitigating the importance of the overhead. In this paper, we study an existing clustering/partitioning strategy to speed up the parallel execution of a task-based application. We provide two additional heuristics to this algorithm and perform an in-depth study on a large graph set. In addition, we propose a new model to estimate the execution duration and use it to choose the proper granularity. We show that this strategy allows speeding up a real numerical application by a factor of 7 on a multi-core system.
BibTeX:
@article{Bramas2020,
  author = {Bérenger Bramas and Alain Ketterlin},
  title = {Improving parallel executions by increasing task granularity in task-based runtime systems using acyclic DAG clustering},
  journal = {PeerJ Computer Science},
  publisher = {PeerJ},
  year = {2020},
  volume = {6},
  pages = {e247},
  doi = {10.7717/peerj-cs.247}
}
Brayford D, Bernau C, Hesse W and Guillen C (2020), "Analyzing Performance Properties Collected by the PerSyst Scalable HPC Monitoring Tool", September, 2020.
Abstract: The ability to understand how a scientific application is executed on a large HPC system is of great importance in allocating resources within the HPC data center. In this paper, we describe how we used system performance data to identify: execution patterns, possible code optimizations and improvements to the system monitoring. We also identify candidates for employing machine learning techniques to predict the performance of similar scientific codes.
BibTeX:
@article{Brayford2020,
  author = {David Brayford and Christoph Bernau and Wolfram Hesse and Carla Guillen},
  title = {Analyzing Performance Properties Collected by the PerSyst Scalable HPC Monitoring Tool},
  year = {2020}
}
Brock B, Buluç A, Mattson TG, McMillan S and Moreira JE (2020), "A Roadmap for the GraphBLAS C++ API". Thesis at: U.C. Berkeley.
Abstract: The GraphBLAS is an API for graph algorithms expressed in terms of linear algebra. The current GraphBLAS specification is for the C Programming Language. Implementations of the GraphBLAS exposed a number of limitations due to C that restrict both the expressiveness and the performance of the GraphBLAS. The C++ language's first-class support for generics, including template metaprogramming, addresses these limitations, yielding a simpler GraphBLAS API that should deliver better performance especially for methods based on user-defined types and operators. When combined with the pervasiveness of C++ across many domains as well as within largescale distributed codes, we see a compelling argument to define a GraphBLAS C++ API. This paper presents a roadmap for the development of a GraphBLAS C++ API with a focus on the open issues we must resolve before completing the specification. Our goal is to foster discussion within the GraphBLAS user community and receive feedback on the directions we are taking with the GraphBLAS C++ API.
BibTeX:
@techreport{Brock2020,
  author = {Benjamin Brock and Aydın Buluç and Timothy G. Mattson and Scott McMillan and Jose E. Moreira},
  title = {A Roadmap for the GraphBLAS C++ API},
  school = {U.C. Berkeley},
  year = {2020}
}
Brock B, Buluc A, Mattson TG, McMillan S, Moreira JE, Pearce R, Selvitopi O and Steil T (2020), "Considerations for a Distributed GraphBLAS API", In Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops., 5, 2020. IEEE.
Abstract: The GraphBLAS emerged from an international effort to standardize linear-algebraic building blocks for computing on graphs and graph-structured data. The GraphBLAS is expressed as a C API and has paved the way for multiple implementations. The GraphBLAS C API, however, does not define how distributed-memory parallelism should be handled. This paper reviews various approaches for a GraphBLAS API for distributed computing. This work is guided by our experience with existing distributed memory libraries. Our goal for this paper is to highlight the pros and cons of different approaches rather than to advocate for one particular choice.
BibTeX:
@inproceedings{Brock2020a,
  author = {Benjamin Brock and Aydin Buluc and Timothy G. Mattson and Scott McMillan and Jose E. Moreira and Roger Pearce and Oguz Selvitopi and Trevor Steil},
  title = {Considerations for a Distributed GraphBLAS API},
  booktitle = {Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops},
  publisher = {IEEE},
  year = {2020},
  doi = {10.1109/ipdpsw50202.2020.00048}
}
Brown C, Abdelfattah A, Tomov S and Dongarra J (2020), "Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs". Thesis at: University of Tennessee.
Abstract: Dense linear algebra (DLA) has historically been in the vanguard of software that must be adapted first to hardware changes. This is because DLA is both critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the problems that novel architectures pose. Therefore, in this paper we investigate the portability of the MAGMA DLA library to the latest AMD GPUs. We use auto tools to convert the CUDA code in MAGMA to the HeterogeneousComputing Interface for Portability (HIP) language. MAGMA provides LAPACK for GPUs and benchmarks for fundamental DLA routines ranging from BLAS to dense factorizations, linear systems and eigen-problem solvers. We port these routines to HIP and quantify currently achievable performance through the MAGMA benchmarks for the main workload algorithms on MI25 and MI50 AMD GPUs. Comparison with performance roofline models and theoretical expectations are used to identify current limitations and directions for future improvements.
BibTeX:
@techreport{Brown2020,
  author = {Cade Brown and Ahmad Abdelfattah and Stanimire Tomov and Jack Dongarra},
  title = {Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs},
  school = {University of Tennessee},
  year = {2020},
  url = {https://www.icl.utk.edu/files/publications/2020/icl-utk-1405-2020.pdf}
}
Brunie H, Iancu C, Ibrahim K, Brisk P and Cook B (2020), "Tuning Floating-Point Precision Using Dynamic Program Information and Temporal Locality", In Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis. Los Alamitos, CA, USA, 11, 2020. , pp. 694-707. IEEE Computer Society.
Abstract: We present a methodology for precision tuning of full applications. These techniquesmust select a search space composed of either variables or instructions and provide a scalable search strategy. In full application settings one cannot assume compiler support for practical reasons. Thus, an additional important challenge is enabling code refactoring. We argue for an instruction-based search space and we show: 1) how to exploit dynamic program information based on call stacks; and 2) how to exploit the iterative nature of scientific codes, combined with temporal locality. We applied the methodology to tune the implementation of scientific codes written in a combination of Python, CUDA, C++ and Fortran, tuning calls to math exp library functions. The iterative search refinement always reduces the search complexity and the number of steps to solution. Dynamic program information increases search efficacy. Using this approach, we obtain application runtime performance improvements up to 27%.
BibTeX:
@inproceedings{Brunie2020,
  author = {H. Brunie and C. Iancu and K. Ibrahim and P. Brisk and B. Cook},
  title = {Tuning Floating-Point Precision Using Dynamic Program Information and Temporal Locality},
  booktitle = {Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis},
  publisher = {IEEE Computer Society},
  year = {2020},
  pages = {694--707},
  url = {https://doi.ieeecomputersociety.org/10.1109/SC41405.2020.00054},
  doi = {10.1109/SC41405.2020.00054}
}
Brunie H, Iancu C, Ibrahim K, Brisk P and Cook B (2020), "Tuning Floating-Point Precision Using Dynamic Program Information and Temporal Locality", In Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis. Los Alamitos, CA, USA, 11, 2020. , pp. 694-707. IEEE Computer Society.
Abstract: We present a methodology for precision tuning of full applications. These techniquesmust select a search space composed of either variables or instructions and provide a scalable search strategy. In full application settings one cannot assume compiler support for practical reasons. Thus, an additional important challenge is enabling code refactoring. We argue for an instruction-based search space and we show: 1) how to exploit dynamic program information based on call stacks; and 2) how to exploit the iterative nature of scientific codes, combined with temporal locality. We applied the methodology to tune the implementation of scientific codes written in a combination of Python, CUDA, C++ and Fortran, tuning calls to math exp library functions. The iterative search refinement always reduces the search complexity and the number of steps to solution. Dynamic program information increases search efficacy. Using this approach, we obtain application runtime performance improvements up to 27%.
BibTeX:
@inproceedings{Brunie2020a,
  author = {H. Brunie and C. Iancu and K. Ibrahim and P. Brisk and B. Cook},
  title = {Tuning Floating-Point Precision Using Dynamic Program Information and Temporal Locality},
  booktitle = {Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis},
  publisher = {IEEE Computer Society},
  year = {2020},
  pages = {694-707},
  url = {https://doi.ieeecomputersociety.org/10.1109/SC41405.2020.00054},
  doi = {10.1109/SC41405.2020.00054}
}
Brust JJ, Leyffer S and Petra CG (2020), "Compact Representations of Structured BFGS Matrices"
Abstract: For general large-scale optimization problems compact representations exist in which recursive quasi-Newton update formulas are represented as compact matrix factorizations. For problems in which the objective function contains additional structure, so-called structured quasiNewton methods exploit available second-derivative information and approximate unavailable second derivatives. This article develops the compact representations of two structured Broyden-FletcherGoldfarb-Shanno update formulas. The compact representations enable efficient limited memory and initialization strategies. Two limited memory line search algorithms are described and tested on a collection of problems.
BibTeX:
@article{Brust2020,
  author = {J. J. Brust and S. Leyffer and C. G. Petra},
  title = {Compact Representations of Structured BFGS Matrices},
  year = {2020}
}
Bueler E (2020), "PETSc for Partial Differential Equations: Numerical Solutions in C and Python", 1, 2020. Society for Industrial and Applied Mathematics.
Abstract: The Portable, Extensible Toolkit for Scientific Computation (PETSc) is an open-source library of advanced data structures and methods for solving linear and nonlinear equations and for managing discretizations. This book uses these modern numerical tools to demonstrate how to solve nonlinear partial differential equations (PDEs) in parallel. It starts from key mathematical concepts, such as Krylov space methods, preconditioning, multigrid, and Newton's method. In PETSc these components are composed at run time into fast solvers.\ Discretizations are introduced from the beginning, with an emphasis on finite difference and finite element methodologies. The example C programs of the first 12 chapters, listed on the inside front cover, solve (mostly) elliptic and parabolic PDE problems. Discretization leads to large, sparse, and generally nonlinear systems of algebraic equations. For such problems, mathematical solver concepts are explained and illustrated through the examples, with sufficient context to speed further development.\
PETSc for Partial Differential Equationsitemize
item addresses both discretization and fast solvers for PDEs;
item emphasizes practice more than theory;
item contains well-structured examples, with advice on run-time solver choices;
item demonstrates how to achieve high performance and parallel scalability; and
item builds on the reader's understanding of fast solver concepts when applying the Firedrake
itemize
Python finite element solver library in the last two chapters.\ This textbook, the first to cover PETSc programming for nonlinear PDEs, provides an on-ramp for graduate students and researchers to a major area of high-performance computing for science and engineering. It is suitable as a supplement for courses in scientific computing or numerical methods for differential equations.
BibTeX:
@book{Bueler2020,
  author = {Ed Bueler},
  title = {PETSc for Partial Differential Equations: Numerical Solutions in C and Python},
  publisher = {Society for Industrial and Applied Mathematics},
  year = {2020},
  doi = {10.1137/1.9781611976311}
}
Bullins B and Lai KA (2020), "Higher-order methods for convex-concave min-max optimization and monotone variational inequalities", July, 2020.
Abstract: We provide improved convergence rates for constrained convex-concave min-max problems and monotone variational inequalities with higher-order smoothness. In min-max settings where the p^th-order derivatives are Lipschitz continuous, we give an algorithm HigherOrderMirrorProx that achieves an iteration complexity of O(1/T^p+12) when given access to an oracle for finding a fixed point of a p^th-order equation. We give analogous rates for the weak monotone variational inequality problem. For p>2, our results improve upon the iteration complexity of the first-order Mirror Prox method of Nemirovski [2004] and the second-order method of Monteiro and Svaiter [2012]. We further instantiate our entire algorithm in the unconstrained p=2 case.
BibTeX:
@article{Bullins2020,
  author = {Brian Bullins and Kevin A. Lai},
  title = {Higher-order methods for convex-concave min-max optimization and monotone variational inequalities},
  year = {2020}
}
Burke JV, Curtis FE, Wang H and Wang J (2020), "Inexact Sequential Quadratic Optimization with Penalty Parameter Updates within the QP Solver", SIAM Journal on Optimization., 1, 2020. Vol. 30(3), pp. 1822-1849. Society for Industrial & Applied Mathematics (SIAM).
Abstract: This paper focuses on the design of sequential quadratic optimization (commonly known as SQP) methods for solving large-scale nonlinear optimization problems. The most computationally demanding aspect of such an approach is the computation of the search direction during each iteration, for which we consider the use of matrix-free methods. In particular, we develop a method that requires an inexact solve of a single QP subproblem to establish the convergence of the overall SQP method. It is known that SQP methods can be plagued by poor behavior of the global convergence mechanism. To confront this issue, we propose the use of an exact penalty function with a dynamic penalty parameter updating strategy to be employed within the subproblem solver in such a way that the resulting search direction predicts progress toward both feasibility and optimality. We present our parameter updating strategy and prove that, under reasonable assumptions, the strategy does not modify the penalty parameter unnecessarily. We close the paper with a discussion of the results of numerical experiments that illustrate the benefits of our proposed techniques.
BibTeX:
@article{Burke2020,
  author = {James V. Burke and Frank E. Curtis and Hao Wang and Jiashan Wang},
  title = {Inexact Sequential Quadratic Optimization with Penalty Parameter Updates within the QP Solver},
  journal = {SIAM Journal on Optimization},
  publisher = {Society for Industrial & Applied Mathematics (SIAM)},
  year = {2020},
  volume = {30},
  number = {3},
  pages = {1822--1849},
  doi = {10.1137/18m1176488}
}
Buttari A, Huber M, Leleux P, Mary T, Rüde U and Wohlmuth B (2020), "Block Low Rank Single Precision Coarse Grid Solvers forExtreme Scale Multigrid Methods". Thesis at: Institut de recherche en informatique de Toulouse (IRIT).
Abstract: Extreme scale simulation requires fast and scalable algorithms, such as multigrid methods. To achieve asymptotically optimal complexity it is essential to employ a hierarchy of grids. The cost to solve the coarsest grid system can often be neglected in sequential computings, but cannot be ignored in massively parallel executions. In this case, the coarsest grid can be large and its efficient solution becomes a challenging task. We propose solving the coarse grid system using modern, approximate sparse direct methods and investigate the expected gains compared with traditional iterative methods. Since the coarse grid system only requires an approximate solution, we show that we can leverage block low-rank techniques, combined with the use of single precision arithmetic, to significantly reduce the computational requirements of the direct solver. In the case of extreme scale computing, the coarse grid system is too large for a sequential solution, but too small to permit massively parallel efficiency. We show that the agglomeration of the coarse grid system to a subset of processors is necessary for the sparse direct solver to achieve performance. We demonstrate the efficiency of the proposed method on a Stokes-type saddle point system. We employ a monolithic Uzawa multigrid method. In particular, we show that the use of an approximate sparse direct solver for the coarse grid system can outperform that of a preconditioned minimal residual iterative method. This is demonstrated for the multigrid solution of systems of order up to 1+e11 degrees of freedom on a petascale supercomputer using 43 200 processes.
BibTeX:
@techreport{Buttari2020,
  author = {Alfredo Buttari and Markus Huber and Philippe Leleux and Theo Mary and Ulrich Rüde and Barbara Wohlmuth},
  title = {Block Low Rank Single Precision Coarse Grid Solvers forExtreme Scale Multigrid Methods},
  school = {Institut de recherche en informatique de Toulouse (IRIT)},
  year = {2020},
  url = {https://hal.archives-ouvertes.fr/hal-02528532}
}
Cai Y and Li P (2020), "Solving the Robust Matrix Completion Problem via a System of Nonlinear Equations", March, 2020.
Abstract: We consider the problem of robust matrix completion, which aims to recover a low rank matrix L_* and a sparse matrix S_* from incomplete observations of their sum M=L_*+S_*∊ℝ^m× n. Algorithmically, the robust matrix completion problem is transformed into a problem of solving a system of nonlinear equations, and the alternative direction method is then used to solve the nonlinear equations. In addition, the algorithm is highly parallelizable and suitable for large scale problems. Theoretically, we characterize the sufficient conditions for when L_* can be approximated by a low rank approximation of the observed M_*. And under proper assumptions, it is shown that the algorithm converges to the true solution linearly. Numerical simulations show that the simple method works as expected and is comparable with state-of-the-art methods.
BibTeX:
@article{Cai2020,
  author = {Yunfeng Cai and Ping Li},
  title = {Solving the Robust Matrix Completion Problem via a System of Nonlinear Equations},
  year = {2020}
}
Calandra H, Gratton S, Riccietti E and Vasseur X (2020), "On a multilevel Levenberg–Marquardt method for the training of artificial neural networks and its application to the solution of partial differential equations", Optimization Methods and Software., 6, 2020. , pp. 1-26. Informa UK Limited.
Abstract: In this paper, we propose a new multilevel Levenberg–Marquardt optimizer for the training of artificial neural networks with quadratic loss function. This setting allows us to get further insight into the potential of multilevel optimization methods. Indeed, when the least squares problem arises from the training of artificial neural networks, the variables subject to optimization are not related by any geometrical constraints and the standard interpolation and restriction operators cannot be employed any longer. A heuristic, inspired by algebraic multigrid methods, is then proposed to construct the multilevel transfer operators. We test the new optimizer on an important application: the approximate solution of partial differential equations by means of artificial neural networks. The learning problem is formulated as a least squares problem, choosing the nonlinear residual of the equation as a loss function, whereas the multilevel method is employed as a training method. Numerical experiments show encouraging results related to the efficiency of the new multilevel optimization method compared to the corresponding one-level procedure in this context.
BibTeX:
@article{Calandra2020,
  author = {H. Calandra and S. Gratton and E. Riccietti and X. Vasseur},
  title = {On a multilevel Levenberg–Marquardt method for the training of artificial neural networks and its application to the solution of partial differential equations},
  journal = {Optimization Methods and Software},
  publisher = {Informa UK Limited},
  year = {2020},
  pages = {1--26},
  doi = {10.1080/10556788.2020.1775828}
}
Campos JS, Misener R and Parpas P (2020), "Partial Lasserre relaxation for sparse Max-Cut"
Abstract: A common approach to solve or find bounds of polynomial optimization problems like Max-Cut is to use the first level of the Lasserre hierarchy. Higher levels of the Lasserre hierarchy provide tighter bounds, but solving these relaxations is usually computationally intractable. We propose to strengthen the first level relaxation for sparse Max-Cut problems using constraints from the second order Lasserre hierarchy. We explore a variety of approaches for adding a subset of the positive semidefinite constraints of the second order sparse relaxation obtained by using the maximum cliques of the graph's chordal extension. We apply this idea to sparse graphs of different sizes and densities, and provide evidence of its strengths and limitations when compared to the state-of-the-art Max-Cut solver BiqCrunch and the alternative sparse relaxation CS-TSSOS.
BibTeX:
@article{Campos2020,
  author = {Juan S. Campos and Ruth Misener and Panos Parpas},
  title = {Partial Lasserre relaxation for sparse Max-Cut},
  year = {2020}
}
Cao X and Liu KJR (2020), "Distributed Newton's Method for Network Cost Minimization", IEEE Transactions on Automatic Control. , pp. 1-1. Institute of Electrical and Electronics Engineers (IEEE).
Abstract: In this work, we examine a novel generic network cost minimization problem, in which every node has a local decision vector to optimize. Each node incurs a cost associated with its decision vector while each link incurs a cost related to the decision vectors of its two end nodes. All nodes collaborate to minimize the overall network cost. The formulated network cost minimization problem has broad applications in distributed signal processing and control, in which the notion of link costs often arises. To solve this problem in a decentralized manner, we develop a distributed variant of the Newton's method, which possesses faster convergence than alternative first order optimization methods such as gradient descent and alternating direction method of multipliers. The proposed method is based on an appropriate splitting of the Hessian matrix and an approximation of its inverse, which is used to determine the Newton step. Global linear convergence of the proposed algorithm is established under several standard technical assumptions on the local cost functions. Furthermore, analogous to classical centralized Newton's method, a quadratic convergence phase of the algorithm over a certain time interval is identified. Finally, numerical simulations are conducted to validate the effectiveness of the proposed algorithm and its superiority over other first order methods, especially when the cost functions are ill-conditioned. Complexity issues of the proposed distributed Newton's method and alternative first order methods are also discussed.
BibTeX:
@article{Cao2020,
  author = {Xuanyu Cao and K. J. Ray Liu},
  title = {Distributed Newton's Method for Network Cost Minimization},
  journal = {IEEE Transactions on Automatic Control},
  publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
  year = {2020},
  pages = {1--1},
  doi = {10.1109/tac.2020.2989266}
}
Cao Q, Pei Y, Akbudak K, Bosilca G, Ltaief H, Keyes DE and Dongarra J (2020), "Leveraging PaRSEC Runtime Support to Tackle Challenging 3D Data-Sparse Matrix Problems", In Proceedings of the 35th IEEE International Parallel & Distributed Processing Symposium.
Abstract: The task-based programming model associated with dynamic runtime systems has gained popularity for challenging problems because of workload imbalance, heterogeneous resources, or extreme concurrency. During the last decade, lowrank matrix approximations, where the main idea consists of exploiting data sparsity typically by compressing off-diagonal tiles up to an application-specific accuracy threshold, have been adopted to address the curse of dimensionality at extreme scale. In this paper, we create a bridge between the runtime and the linear algebra by communicating knowledge of the data sparsity to the runtime. We design and implement this synergistic approach with high user productivity in mind, in the context of the PaRSEC runtime system and the HiCMA numerical library. This requires to extend PaRSEC with new features to integrate rank information into the dataflow so that proper decisions can be taken at runtime. We focus on the tile low-rank (TLR) Cholesky factorization for solving 3D data-sparse covariance matrix problems arising in environmental applications. In particular, we employ the 3D exponential model of Matern matrix kernel, which exhibits challenging nonuniform high ranks in off-diagonal tiles. We first provide a dynamic data structure management driven by a performance model to reduce extra floating-point operations. Next, we optimize the memory footprint of the application by relying on a dynamic memory allocator, and supported by a rank-aware data distribution to cope with the workload imbalance. Finally, we expose further parallelism using kernel recursive formulations to shorten the critical path. Our resulting high-performance implementation outperforms existing data-sparse TLR Cholesky factorization by up to 7-fold on a large-scale distributed-memory system, while minimizing the memory footprint up to a 44-fold factor. This multidisciplinary work highlights the need to empower runtime systems beyond their original duty of task scheduling for servicing next-generation low-rank matrix algebra libraries.
BibTeX:
@inproceedings{Cao2020a,
  author = {Cao, Qinglei and Pei, Yu and Akbudak, Kadir and Bosilca, George and Ltaief, Hatem and Keyes, David E. and Dongarra, Jack},
  title = {Leveraging PaRSEC Runtime Support to Tackle Challenging 3D Data-Sparse Matrix Problems},
  booktitle = {Proceedings of the 35th IEEE International Parallel & Distributed Processing Symposium},
  year = {2020}
}
Carmon Y and Duchi JC (2020), "First-Order Methods for Nonconvex Quadratic Minimization", March, 2020.
Abstract: We consider minimization of indefinite quadratics with either trust-region (norm) constraints or cubic regularization. Despite the nonconvexity of these problems we prove that, under mild assumptions, gradient descent converges to their global solutions, and give a non-asymptotic rate of convergence for the cubic variant. We also consider Krylov subspace solutions and establish sharp convergence guarantees to the solutions of both trust-region and cubic-regularized problems. Our rates mirror the behavior of these methods on convex quadratics and eigenvector problems, highlighting their scalability. When we use Krylov subspace solutions to approximate the cubic-regularized Newton step, our results recover the strongest known convergence guarantees to approximate second-order stationary points of general smooth nonconvex functions.
BibTeX:
@article{Carmon2020,
  author = {Yair Carmon and John C. Duchi},
  title = {First-Order Methods for Nonconvex Quadratic Minimization},
  year = {2020}
}
Carratalá-Sáez R, Faverge M, Pichon G, Sylvand G and Quintana-Ortí ES (2020), "Tiled Algorithms for Efficient Task-Parallel H-Matrix Solvers". Thesis at: INRIA.
Abstract: In this paper, we describe and evaluate an extension of the Chameleon library to operate with hierarchical matrices (H-Matrices) and hierarchical arithmetic (H-Arithmetic), producing efficient solvers for linear systems arising in Boundary Element Methods (BEM). Our approach builds upon an open-source H-Matrices library from Airbus, named Hmat-oss, that collects sequential numerical kernels for both hierarchical and low-rank structures; the tiled algorithms and task-parallel decompositions available in Chameleon for the solution of linear systems; and the StarPU runtime system to orchestrate an efficient task-parallel (multi-threaded) execution on a multicore architecture. Using an application producing matrices with features close to real industrial applications, we present shared-memory results that demonstrate a fair level of performance, close to (and sometimes better than) the one offered by a pure H-Matrix approach, as proposed by Airbus Hmat proprietary (and non open-source) library. Hence, this combination Chameleon + Hmat-oss proposes the most efficient fully open-source software stack to solve dense compressible linear systems on shared memory architectures (distributed memory is under development).
BibTeX:
@techreport{Carratala2020,
  author = {Rocío Carratalá-Sáez and Mathieu Faverge and Grégoire Pichon and Guillaume Sylvand and Enrique S. Quintana-Ortí},
  title = {Tiled Algorithms for Efficient Task-Parallel H-Matrix Solvers},
  school = {INRIA},
  year = {2020},
  url = {https://hal.inria.fr/hal-02489269}
}
Carson E and Strakoš Z (2020), "On the cost of iterative computations", Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences., 1, 2020. Vol. 378(2166), pp. 20190050. The Royal Society.
Abstract: With exascale-level computation on the horizon, the art of predicting the cost of computations has acquired a renewed focus. This task is especially challenging in the case of iterative methods, for which convergence behaviour often cannot be determined with certainty a priori (unless we are satisfied with potentially outrageous overestimates) and which typically suffer from performance bottlenecks at scale due to synchronization cost. Moreover, the amplification of rounding errors can substantially affect the practical performance, in particular for methods with short recurrences. In this article, we focus on what we consider to be key points which are crucial to understanding the cost of iteratively solving linear algebraic systems. This naturally leads us to questions on the place of numerical analysis in relation to mathematics, computer science and sciences, in general.
BibTeX:
@article{Carson2020,
  author = {Carson, Erin and Strakoš, Zdeněk},
  title = {On the cost of iterative computations},
  journal = {Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences},
  publisher = {The Royal Society},
  year = {2020},
  volume = {378},
  number = {2166},
  pages = {20190050},
  doi = {10.1098/rsta.2019.0050}
}
Carson E, Higham NJ and Pranesh S (2020), "Three-Precision GMRES-Based Iterative Refinement for Least Squares Problems"
Abstract: The standard iterative refinement procedure for improving an approximate solution to the least squares problem min_x ||b - Ax||_2, where A ∊ ℝ^m × n with m ge n has full rank, is based on solving the (m + n) × (m + n) augmented system with the aid of a QR factorization. In order to exploit multiprecision arithmetic, iterative refinement can be formulated to use three precisions, but the resulting algorithm converges only for a limited range of problems. We build an iterative refinement algorithm called GMRES-LSIR, analogous to the GMRES-IR algorithm developed for linear systems [SIAM J. Sci. Comput., 40 (2019), pp. A817-A847], that solves the augmented system using GMRES preconditioned by a matrix based on the computed QR factors. We explore two left preconditioners; the first has full off-diagonal blocks and the second is block diagonal and can be applied in either left-sided or split form. We prove that for a wide range of problems the first preconditioner yields backward and forward errors for the augmented system of order the working precision under suitable assumptions on the precisions and the problem conditioning. Our proof does not extend to the block diagonal preconditioner, but our numerical experiments show that with this preconditioner the algorithm performs about as well in practice.
BibTeX:
@article{Carson2020a,
  author = {Carson, Erin and Higham, Nicholas J. and Pranesh, Srikara},
  title = {Three-Precision GMRES-Based Iterative Refinement for Least Squares Problems},
  year = {2020},
  url = {http://eprints.maths.manchester.ac.uk/2745/1/paper.pdf}
}
Carson E and Gergelits T (2020), "Quarter I Report: Initial Exploration of the Use of Mixed Precision in Iterative Solvers". Thesis at: LLNL-Charles University.
Abstract: The first quarter of the project was primarily spent identifying potential projects at the intersection of finite precision analysis, mixed precision computation, and Krylov subspace methods. We summarize our findings in the remainder of the document. Other activities include attending biweekly xSDK meetings as well as contributing material to the technical report and journal versions of the multiprecision landscape paper.\ The subsequent quarter will be spent selecting a subset of the described projects to focus on, performing initial numerical experiments to evaluate the potential for the use of mixed precision, and developing initial theoretical analysis.
BibTeX:
@techreport{Carson2020b,
  author = {Erin Carson and Tomáš Gergelits},
  title = {Quarter I Report: Initial Exploration of the Use of Mixed Precision in Iterative Solvers},
  school = {LLNL-Charles University},
  year = {2020}
}
Carson E, Lund K, Rozložník M and Thomas S (2020), "An overview of block Gram-Schmidt methods and their stability properties", October, 2020.
Abstract: Block Gram-Schmidt algorithms comprise essential kernels in many scientific computing applications, but for many commonly used variants, a rigorous treatment of their stability properties remains open. This survey provides a comprehensive categorization of block Gram-Schmidt algorithms, especially those used in Krylov subspace methods to build orthonormal bases one block vector at a time. All known stability results are assembled, and new results are summarized or conjectured for important communication-reducing variants. A diverse array of numerical illustrations are presented, along with the MATLAB code for reproducing the results in a publicly available at repository https://github.com/katlund/BlockStab. A number of open problems are discussed, and an appendix containing all algorithms type-set in a uniform fashion is provided.
BibTeX:
@article{Carson2020c,
  author = {Erin Carson and Kathryn Lund and Miroslav Rozložník and Stephen Thomas},
  title = {An overview of block Gram-Schmidt methods and their stability properties},
  year = {2020}
}
Cartis C, Gould N and Toint PL (2020), "Strong Evaluation Complexity Bounds for Arbitrary-Order Optimization of Nonconvex Nonsmooth Composite Functions", January, 2020.
Abstract: We introduce the concept of strong high-order approximate minimizers for nonconvex optimization problems. These apply in both standard smooth and composite non-smooth settings, and additionally allow convex or inexpensive constraints. An adaptive regularization algorithm is then proposed to find such approximate minimizers. Under suitable Lipschitz continuity assumptions, whenever the feasible set is convex, it is shown that using a model of degree p, this algorithm will find a strong approximate q-th-order minimizer in at most cal O(_1≤ j≤ q_j^-(p+1)/(p-j+1)) evaluations of the problem's functions and their derivatives, where _j is the j-th order accuracy tolerance; this bound applies when either q=1 or the problem is not composite with q ≤ 2. For general non-composite problems, even when the feasible set is nonconvex, the bound becomes cal O(_1≤ j≤ q_j^-q(p+1)/p) evaluations. If the problem is composite, and either q > 1 or the feasible set is not convex, the bound is then cal O(_1≤ j≤ q_j^-(q+1)) evaluations. These results not only provide, to our knowledge, the first known bound for (unconstrained or inexpensively-constrained) composite problems for optimality orders exceeding one, but also give the first sharp bounds for high-order strong approximate q-th order minimizers of standard (unconstrained and inexpensively constrained) smooth problems, thereby complementing known results for weak minimizers.
BibTeX:
@article{Cartis2020,
  author = {Coralia Cartis and Nick Gould and Philippe L. Toint},
  title = {Strong Evaluation Complexity Bounds for Arbitrary-Order Optimization of Nonconvex Nonsmooth Composite Functions},
  year = {2020}
}
Cartis C, Gould NIM and Toint PL (2020), "Strong Evaluation Complexity of An Inexact Trust-Region Algorithm for Arbitrary-Order Unconstrained Nonconvex Optimization", November, 2020.
Abstract: A trust-region algorithm using inexact function and derivatives values is introduced for solving unconstrained smooth optimization problems. This algorithm uses high-order Taylor models and allows the search of strong approximate minimizers of arbitrary order. The evaluation complexity of finding a q-th approximate minimizer using this algorithm is then shown, under standard conditions, to be 𝒪(_j∊1,\ldots,q\_j^-(q+1)) where the _j are the order-dependent requested accuracy thresholds. Remarkably, this order is identical to that of classical trust-region methods using exact information.
BibTeX:
@article{Cartis2020a,
  author = {C. Cartis and N. I. M. Gould and Ph. L. Toint},
  title = {Strong Evaluation Complexity of An Inexact Trust-Region Algorithm for Arbitrary-Order Unconstrained Nonconvex Optimization},
  year = {2020}
}
Champion C, Mélanie B, Rémy B, Jean-Michel L and Laurent R (2020), "Robust spectral clustering using LASSO regularization", April, 2020.
Abstract: Cluster structure detection is a fundamental task for the analysis of graphs, in order to understand and to visualize their functional characteristics. Among the different cluster structure detection methods, spectral clustering is currently one of the most widely used due to its speed and simplicity. Yet, there are few theoretical guarantee to recover the underlying partitions of the graph for general models. This paper therefore presents a variant of spectral clustering, called 1-spectral clustering, performed on a new random model closely related to stochastic block model. Its goal is to promote a sparse eigenbasis solution of a 1 minimization problem revealing the natural structure of the graph. The effectiveness and the robustness to small noise perturbations of our technique is confirmed through a collection of simulated and real data examples.
BibTeX:
@article{Champion2020,
  author = {Camille Champion and Blazère Mélanie and Burcelin Rémy and Loubes Jean-Michel and Risser Laurent},
  title = {Robust spectral clustering using LASSO regularization},
  year = {2020}
}
Chang TH (2020), "Mathematical Software for Multiobjective Optimization Problems". Thesis at: Virginia Polytechnic Institute and State University.
Abstract: In this thesis, two distinct problems in data-driven computational science are considered. The main problem of interest is the multiobjective optimization problem, where the tradeoff surface (called the Pareto front) between multiple conflicting objectives must be approximated in order to identify designs that balance real-world tradeoffs. In order to solve multiobjective optimization problems that are derived from computationally expensive blackbox functions, such as engineering design optimization problems, several methodologies are combined, including surrogate modeling, trust region methods, and adaptive weighting. The result is a numerical software package that finds approximately Pareto optimal solutions that are evenly distributed across the Pareto front, using minimal cost function evaluations. The second problem of interest is the closely related problem of multivariate interpolation, where an unknown response surface representing an underlying phenomenon is approximated by finding a function that exactly matches available data. To solve the interpolation problem, a novel algorithm is proposed for computing only a sparse subset of the elements in the Delaunay triangulation, as needed to compute the Delaunay interpolant. For high-dimensional data, this reduces the time and space complexity of Delaunay interpolation from exponential time to polynomial time in practice. For each of the above problems, both serial and parallel implementations are described. Additionally, both solutions are demonstrated on real-world problems in computer system performance modeling.
BibTeX:
@phdthesis{Chang2020,
  author = {Tyler H. Chang},
  title = {Mathematical Software for Multiobjective Optimization Problems},
  school = {Virginia Polytechnic Institute and State University},
  year = {2020}
}
Chatzidimitriou A and Gizopoulos D (2020), "rACE: Reverse-Order Processor Reliability Analysis", In Proceedings of the 2020 Design, Automation Test in Europe Conference Exhibition., 3, 2020. , pp. 1115-1120.
Abstract: Modern microprocessors suffer from increased error rates that come along with fabrication technology scaling. Processor designs continuously become more prone to hardware faults that lead to execution errors and system failures, which raise the requirement of protection mechanisms. However, error mitigation strategies have to be applied diligently, as they impose significant power, area, and performance overheads. Early and accurate reliability estimation of a microprocessor design is essential in order to determine the most vulnerable hardware structures and the most efficient protection schemes. One of the most commonly used techniques for reliability estimation is Architecturally Correct Execution (ACE) analysis.ACE analysis can be applied at different abstraction models, including microarchitecture and RTL and often requires a single or few simulations to report the Architectural Vulnerability Factor (AVF) of the processor structures. However, ACE analysis overestimates the vulnerability of structures because of its pessimistic, worst-case nature. Moreover, it only delivers coarse-grain vulnerability reports and no details about the expected result of hardware faults (silent data corruptions, crashes). In this paper, we present reverse ACE (rACE), a methodology that (a) improves the accuracy of ACE analysis and (b) delivers fine-grain error outcome reports. Using a reverse-order tracing flow, rACE analysis associates portions of the simulated execution of a program with the actual output and the control flow, delivering finer accuracy and results classification. Our findings show that rACE reports an average 1.45× overestimation, compared to Statistical Fault Injection, for different sizes of the register file of an out-of-order CPU core (executing both ARM and x86 binaries), when a baseline ACE analysis reports 2.3× overestimation and even refined versions of ACE analysis report an average of 1.8× overestimation.
BibTeX:
@inproceedings{Chatzidimitriou2020,
  author = {A. Chatzidimitriou and D. Gizopoulos},
  title = {rACE: Reverse-Order Processor Reliability Analysis},
  booktitle = {Proceedings of the 2020 Design, Automation Test in Europe Conference Exhibition},
  year = {2020},
  pages = {1115--1120},
  doi = {10.23919/DATE48585.2020.9116355}
}
Chen Y, Wang S, Zheng F and Cen Y (2020), "Graph-regularized least squares regression for multi-view subspace clustering", Knowledge-Based Systems., 1, 2020. , pp. 105482. Elsevier BV.
Abstract: Many works have proven that the consistency and differences in multi-view subspace clustering make the clustering results better than the single-view clustering. Therefore, this paper studies the multi-view clustering problem, which aims to divide data points into several groups using multiple features. However, existing multi-view clustering methods fail to capturing the grouping effect and local geometrical structure of the multiple features. In order to solve these problems, this paper proposes a novel multi-view subspace clustering model called graph-regularized least squares regression (GLSR), which uses not only the least squares regression instead of the nuclear norm to generate grouping effect, but also the manifold constraint to preserve the local geometrical structure of multiple features. Specifically, the proposed GLSR method adopts the least squares regression to learn the globally consensus information shared by multiple views and the column-sparsity norm to measure the residual information. Under the alternating direction method of multipliers framework, an effective method is developed by iteratively update all variables. Numerical studies on eight real databases demonstrate the effectiveness and superior performance of the proposed GLSR over eleven state-of-the-art methods.
BibTeX:
@article{Chen2020,
  author = {Yongyong Chen and Shuqin Wang and Fangying Zheng and Yigang Cen},
  title = {Graph-regularized least squares regression for multi-view subspace clustering},
  journal = {Knowledge-Based Systems},
  publisher = {Elsevier BV},
  year = {2020},
  pages = {105482},
  doi = {10.1016/j.knosys.2020.105482}
}
Chen T, Lasserre J-B, Magron V and Pauwels E (2020), "Polynomial Optimization for Bounding Lipschitz Constants of Deep Networks", February, 2020.
Abstract: The Lipschitz constant of a network plays an important role in many applications of deep learning, such as robustness certification and Wasserstein Generative Adversarial Network. We introduce a semidefinite programming hierarchy to estimate the global and local Lipschitz constant of a multiple layer deep neural network. The novelty is to combine a polynomial lifting for ReLU functions derivatives with a weak generalization of Putinar's positivity certificate. This idea could also apply to other, nearly sparse, polynomial optimization problems in machine learning. We empirically demonstrate that our method not only runs faster than state-of-the-art linear programming based method, but also provides sharper bounds.
BibTeX:
@article{Chen2020a,
  author = {Tong Chen and Jean-Bernard Lasserre and Victor Magron and Edouard Pauwels},
  title = {Polynomial Optimization for Bounding Lipschitz Constants of Deep Networks},
  year = {2020}
}
Chen Y, Xiao G, Wu F, Tang Z and Li K (2020), "tpSpMV: A Two-Phase Large-scale Sparse Matrix-Vector Multiplication Kernel for Manycore Architectures", Information Sciences., March, 2020. Elsevier BV.
Abstract: Sparse matrix-vector multiplication (SpMV) is one of the important subroutines in numerical linear algebras widely used in lots of large-scale applications. Accelerating SpMV on multicore and manycore architectures based on Compressed Sparse Row (CSR) format via row-wise parallelization is one of the most popular directions. However, there are three main challenges in optimizing parallel CSR-based SpMV: (a) limited local memory of each computing unit can be overwhelmed by assignments to long rows of large-scale sparse matrices; (b) irregular accesses to the input vector result in expensive memory access latency; (c) sparse data structure leads to low bandwidth usage. This paper proposes a two-phase large-scale SpMV, called tpSpMV, based on the memory structure and computing architecture of multicore and manycore architectures to alleviate the three main difficulties. First, we propose the two-phase parallel execution technique for tpSpMV that performs parallel CSR-based SpMV into two separate phases to overcome the computational scale limitation. Second, we respectively propose the adaptive partitioning methods and parallelization designs using the local memory caching technique for the two phases to exploit the architectural advantages of the high-performance computing platforms and alleviate the problem of high memory access latency. Third, we design several optimizations, such as data reduction, aligned memory accessing, and pipeline technique, to improve bandwidth usage and optimize tpSpMV's performance. Experimental results on SW26010 CPUs of the Sunway TaihuLight supercomputer prove that tpSpMV achieves up to 28.61 speedups and yields the performance improvement of 13.16% over the state-of-the-art work on average.
BibTeX:
@article{Chen2020b,
  author = {Yuedan Chen and Guoqing Xiao and Fan Wu and Zhuo Tang and Keqin Li},
  title = {tpSpMV: A Two-Phase Large-scale Sparse Matrix-Vector Multiplication Kernel for Manycore Architectures},
  journal = {Information Sciences},
  publisher = {Elsevier BV},
  year = {2020},
  doi = {10.1016/j.ins.2020.03.020}
}
Chen Y, Xiao G, Ozsu MT, Liu C, Zomaya A and Li T (2020), "aeSpTV: An Adaptive and Efficient Framework for Sparse Tensor-Vector Product Kernel on a High-Performance Computing Platform", IEEE Transactions on Parallel and Distributed Systems. , pp. 1-1. Institute of Electrical and Electronics Engineers (IEEE).
Abstract: Multi-dimensional, large-scale, and sparse data, which can be neatly represented by sparse tensors, are increasingly used in various applications such as data analysis and machine learning. A high-performance sparse tensor-vector product (SpTV), one of the most fundamental operations of processing sparse tensors, is necessary for improving efficiency of related applications. In this paper, we propose aeSpTV, an adaptive and efficient SpTV framework on Sunway TaihuLight supercomputer, to solve several challenges of optimizing SpTV on high-performance computing platforms. First, to map SpTV to Sunway architecture and tame expensive memory access latency and parallel writing conflict due to the intrinsic irregularity of SpTV, we introduce an adaptive SpTV parallelization. Second, to co-execute with the parallelization design while still ensuring high efficiency, we design a sparse tensor data structure named CSSoCR. Third, based on the adaptive SpTV parallelization with the novel tensor data structure, we present an auto-tuner that chooses the most befitting tensor partitioning method for aeSpTV using the variance analysis theory of mathematical statistics to achieve load balance. Fourth, to further leverage the computing power of Sunway, we propose customized optimizations for aeSpTV. Experimental results show that aeSpTV yields good sacalability on both thread-level and process-level parallelism of Sunway. It achieves a maximum GFLOPS of 195.69 on 128 processes. Additionally, it is proved that optimization effects of the partitioning auto-tuner and optimization techniques are remarkable.
BibTeX:
@article{Chen2020c,
  author = {Yuedan Chen and Guoqing Xiao and M. Tamer Ozsu and Chubo Liu and Albert Zomaya and Tao Li},
  title = {aeSpTV: An Adaptive and Efficient Framework for Sparse Tensor-Vector Product Kernel on a High-Performance Computing Platform},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
  year = {2020},
  pages = {1--1},
  doi = {10.1109/tpds.2020.2990429}
}
Chen L, Hu X and Wu H (2020), "Randomized Fast Subspace Descent Methods", June, 2020.
Abstract: Randomized Fast Subspace Descent (RFASD) Methods are developed and analyzed for smooth and non-constraint convex optimization problems. The efficiency of the method relies on a space decomposition which is stable in A-norm, and meanwhile, the condition number _A measured in A-norm is small. At each iteration, the subspace is chosen randomly either uniformly or by a probability proportional to the local Lipschitz constants. Then in each chosen subspace, a preconditioned gradient descent method is applied. RFASD converges sublinearly for convex functions and linearly for strongly convex functions. Comparing with the randomized block coordinate descent methods, the convergence of RFASD is faster provided _A is small and the subspace decomposition is A-stable. This improvement is supported by considering a multilevel space decomposition for Nesterov's `worst' problem.
BibTeX:
@article{Chen2020d,
  author = {Long Chen and Xiaozhe Hu and Huiwen Wu},
  title = {Randomized Fast Subspace Descent Methods},
  year = {2020}
}
Chen S, Ma S, Xue L and Zou H (2020), "An Alternating Manifold Proximal Gradient Method for Sparse Principal Component Analysis and Sparse Canonical Correlation Analysis", INFORMS Journal on Optimization., 7, 2020. , pp. ijoo.2019.0032. Institute for Operations Research and the Management Sciences (INFORMS).
Abstract: Sparse principal component analysis and sparse canonical correlation analysis are two essential techniques from high-dimensional statistics and machine learning for analyzing large-scale data. Both problems can be formulated as an optimization problem with nonsmooth objective and nonconvex constraints. Because nonsmoothness and nonconvexity bring numerical difficulties, most algorithms suggested in the literature either solve some relaxations of them or are heuristic and lack convergence guarantees. In this paper, we propose a new alternating manifold proximal gradient method to solve these two high-dimensional problems and provide a unified convergence analysis. Numerical experimental results are reported to demonstrate the advantages of our algorithm.
BibTeX:
@article{Chen2020e,
  author = {Shixiang Chen and Shiqian Ma and Lingzhou Xue and Hui Zou},
  title = {An Alternating Manifold Proximal Gradient Method for Sparse Principal Component Analysis and Sparse Canonical Correlation Analysis},
  journal = {INFORMS Journal on Optimization},
  publisher = {Institute for Operations Research and the Management Sciences (INFORMS)},
  year = {2020},
  pages = {ijoo.2019.0032},
  doi = {10.1287/ijoo.2019.0032}
}
Chen C, Liang T and Biros G (2020), "RCHOL: Randomized Cholesky Factorization for Solving SDD Linear Systems", November, 2020.
Abstract: We introduce a randomized algorithm, namely tt rchol, to construct an approximate Cholesky factorization for a given sparse Laplacian matrix (a.k.a., graph Laplacian). The (exact) Cholesky factorization for the matrix introduces a clique in the associated graph after eliminating every row/column. By randomization, tt rchol samples a subset of the edges in the clique. We prove tt rchol is breakdown free and apply it to solving linear systems with symmetric diagonally-dominant matrices. In addition, we parallelize tt rchol based on the nested dissection ordering for shared-memory machines. Numerical experiments demonstrated the robustness and the scalability of tt rchol. For example, our parallel code scaled up to 64 threads on a single node for solving the 3D Poisson equation, discretized with the 7-point stencil on a 1024× 1024 × 1024 grid, or one billion unknowns.
BibTeX:
@article{Chen2020f,
  author = {Chao Chen and Tianyu Liang and George Biros},
  title = {RCHOL: Randomized Cholesky Factorization for Solving SDD Linear Systems},
  year = {2020}
}
Cheng Y, Panigrahi D and Sun K (2020), "Sparsification of Balanced Directed Graphs", June, 2020.
Abstract: Sparsification, where the cut values of an input graph are approximately preserved by a sparse graph (called a cut sparsifier) or a succinct data structure (called a cut sketch), has been an influential tool in graph algorithms. But, this tool is restricted to undirected graphs, because some directed graphs are known to not admit sparsification. Such examples, however, are structurally very dissimilar to undirected graphs in that they exhibit highly unbalanced cuts. This motivates us to ask: can we sparsify a balanced digraph? To make this question concrete, we define balance β of a digraph as the maximum ratio of the cut value in the two directions (Ene et al., STOC 2016). We show the following results: For-All Sparsification: If all cut values need to be simultaneously preserved (cf. Benczúr and Karger, STOC 1996), then we show that the size of the sparsifier (or even cut sketch) must scale linearly with β. The upper bound is a simple extension of sparsification of undirected graphs (formally stated recently in Ikeda and Tanigawa (WAOA 2018)), so our main contribution here is to show a matching lower bound. For-Each Sparsification: If each cut value needs to be individually preserved (Andoni et al., ITCS 2016), then the situation is more interesting. Here, we give a cut sketch whose size scales with \beta, thereby beating the linear lower bound above. We also show that this result is tight by exhibiting a matching lower bound of \beta on "for-each" cut sketches. Our upper bounds work for general weighted graphs, while the lower bounds even hold for unweighted graphs with no parallel edges.
BibTeX:
@article{Cheng2020,
  author = {Yu Cheng and Debmalya Panigrahi and Kevin Sun},
  title = {Sparsification of Balanced Directed Graphs},
  year = {2020}
}
Chennupati G, Santhi N, Romero P and Eidenbenz S (2020), "Machine Learning Enabled Scalable Performance Prediction of Scientific Codes", October, 2020.
Abstract: We present the Analytical Memory Model with Pipelines (AMMP) of the Performance Prediction Toolkit (PPT). PPT-AMMP takes high-level source code and hardware architecture parameters as input, predicts runtime of that code on the target hardware platform, which is defined in the input parameters. PPT-AMMP transforms the code to an (architecture-independent) intermediate representation, then (i) analyzes the basic block structure of the code, (ii) processes architecture-independent virtual memory access patterns that it uses to build memory reuse distance distribution models for each basic block, (iii) runs detailed basic-block level simulations to determine hardware pipeline usage. PPT-AMMP uses machine learning and regression techniques to build the prediction models based on small instances of the input code, then integrates into a higher-order discrete-event simulation model of PPT running on Simian PDES engine. We validate PPT-AMMP on four standard computational physics benchmarks, finally present a use case of hardware parameter sensitivity analysis to identify bottleneck hardware resources on different code inputs. We further extend PPT-AMMP to predict the performance of scientific application (radiation transport), SNAP. We analyze the application of multi-variate regression models that accurately predict the reuse profiles and the basic block counts. The predicted runtimes of SNAP when compared to that of actual times are accurate.
BibTeX:
@article{Chennupati2020,
  author = {Gopinath Chennupati and Nandakishore Santhi and Phill Romero and Stephan Eidenbenz},
  title = {Machine Learning Enabled Scalable Performance Prediction of Scientific Codes},
  year = {2020}
}
Cheramangalath U, Nasre R and Srikant YN (2020), "Graph Analytics Frameworks", In Distributed Graph Analytics. , pp. 99-122. Springer International Publishing.
Abstract: Frameworks take away the drudgery of routine tasks in programming graph analytic applications. This chapter describes in some detail, the different models of execution that are used in graph analytics, such as BSP, Map-Reduce, asynchronous execution, GAS, Inspector-Executor, and Advance-Filter-Compute. It also provides a glimpse of different existing frameworks on multi-core CPUs, GPUs, and distributed systems.
BibTeX:
@incollection{Cheramangalath2020,
  author = {Unnikrishnan Cheramangalath and Rupesh Nasre and Y. N. Srikant},
  title = {Graph Analytics Frameworks},
  booktitle = {Distributed Graph Analytics},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {99--122},
  doi = {10.1007/978-3-030-41886-1_4}
}
Cherubin S, Cattaneo D, Chiari M and Agosta G (2020), "Dynamic Precision Autotuning with TAFFO", ACM Transactions on Architecture and Code Optimization., 5, 2020. Vol. 17(2), pp. 1-26. Association for Computing Machinery (ACM).
Abstract: Many classes of applications, both in the embedded and high performance domains, can trade off the accuracy of the computed results for computation performance. One way to achieve such a trade-off is precision tuning—that is, to modify the data types used for the computation by reducing the bit width, or by changing the representation from floating point to fixed point. We present a methodology for high-accuracy dynamic precision tuning based on the identification of input classes (i.e., classes of input datasets that benefit from similar optimizations). When a new input region is detected, the application kernels are re-compiled on the fly with the appropriate selection of parameters. In this way, we obtain a continuous optimization approach that enables the exploitation of the reduced precision computation while progressively exploring the solution space, thus reducing the time required by compilation overheads. We provide tools to support the automation of the runtime part of the solution, leaving to the user only the task of identifying the input classes. Our approach provides a significant performance boost (up to 320%) on the typical approximate computing benchmarks, without meaningfully affecting the accuracy of the result, since the error remains always below 3%.
BibTeX:
@article{Cherubin2020,
  author = {Stefano Cherubin and Daniele Cattaneo and Michele Chiari and Giovanni Agosta},
  title = {Dynamic Precision Autotuning with TAFFO},
  journal = {ACM Transactions on Architecture and Code Optimization},
  publisher = {Association for Computing Machinery (ACM)},
  year = {2020},
  volume = {17},
  number = {2},
  pages = {1--26},
  url = {https://dl.acm.org/doi/pdf/10.1145/3388785},
  doi = {10.1145/3388785}
}
Chevalier C, Ledoux F and Morais S (2020), "A Multilevel Mesh Partitioning Algorithm Driven by Memory Constraints", In Proceedings of the SIAM Workshop on Combinatorial Scientific Computing., 1, 2020. , pp. 85-95. Society for Industrial and Applied Mathematics.
Abstract: Running numerical simulations on HPC architectures requires distributing data to be processed over the various available processing units. This task is usually done by partitioning tools, whose primary goal is to balance the workload while minimizing inter-process communication. However, they do not take the memory load and memory capacity of the processing units into account. As this can lead to memory overflow, we propose a new approach to address mesh partitioning by including ghost cells in the memory usage and by considering memory capacity as a strong constraint to abide. We model the problem using a bipartite graph and present a new greedy algorithm that aims at producing a partition according to the memory capacity. This algorithm focuses on memory consumption, and we use it in a multi-level approach to improving the quality of the returned solutions during the refinement phase. The experimental results obtained from our benchmarks show that our approach can yield solutions respecting memory constraints for instances where traditional partitioning tools fail.
BibTeX:
@incollection{Chevalier2020,
  author = {Cédric Chevalier and Franck Ledoux and Sébastien Morais},
  title = {A Multilevel Mesh Partitioning Algorithm Driven by Memory Constraints},
  booktitle = {Proceedings of the SIAM Workshop on Combinatorial Scientific Computing},
  publisher = {Society for Industrial and Applied Mathematics},
  year = {2020},
  pages = {85--95},
  doi = {10.1137/1.9781611976229.9}
}
Chieu NH, Hien LV and Trang NTQ (2020), "Tilt Stability for Quadratic Programs with One or Two Quadratic Inequality Constraints", Acta Mathematica Vietnamica., 6, 2020. Vol. 45(2), pp. 477-499. Springer Science and Business Media LLC.
Abstract: This paper examines tilt stability for quadratic programs with one or two quadratic inequality constraints. Exploiting specific features of these problems and using some known results on tilt stability in nonlinear programming, we establish quite simple characterizations of tilt-stable local minimizers for quadratic programs with one quadratic inequality constraint under metric subregularity constraint qualification. By the same way, we also derive various tilt stability conditions for quadratic programs with two quadratic inequality constraints and satisfying certain suitable assumptions. Especially, the obtained results show that some tilt stability conditions only known to be sufficient in nonlinear programming become the necessary ones when the considered problems are quadratic programs with one or two quadratic inequality constraints.
BibTeX:
@article{Chieu2020,
  author = {Nguyen Huy Chieu and Le Van Hien and Nguyen Thi Quynh Trang},
  title = {Tilt Stability for Quadratic Programs with One or Two Quadratic Inequality Constraints},
  journal = {Acta Mathematica Vietnamica},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  volume = {45},
  number = {2},
  pages = {477--499},
  doi = {10.1007/s40306-020-00372-4}
}
Chilukuri A, Milthorpe J and Johnston B (2020), "Characterizing Optimizations to Memory Access Patterns using Architecture-Independent Program Features", March, 2020.
Abstract: High-performance computing developers are faced with the challenge of optimizing the performance of OpenCL workloads on diverse architectures. The Architecture-Independent Workload Characterization (AIWC) tool is a plugin for the Oclgrind OpenCL simulator that gathers metrics of OpenCL programs that can be used to understand and predict program performance on an arbitrary given hardware architecture. However, AIWC metrics are not always easily interpreted and do not reflect some important memory access patterns affecting efficiency across architectures. We propose a new metric of parallel spatial locality -- the closeness of memory accesses simultaneously issued by OpenCL work-items (threads). We implement the parallel spatial locality metric in the AIWC framework, and analyse gathered results on matrix multiply and the Extended OpenDwarfs OpenCL benchmarks. The differences in the observed parallel spatial locality metric across implementations of matrix multiply reflect the optimizations performed. The new metric can be used to distinguish between the OpenDwarfs benchmarks based on the memory access patterns affecting their performance on various architectures. The improvements suggested to AIWC will help HPC developers better understand memory access patterns of complex codes and guide optimization of codes for arbitrary hardware targets.
BibTeX:
@article{Chilukuri2020,
  author = {Aditya Chilukuri and Josh Milthorpe and Beau Johnston},
  title = {Characterizing Optimizations to Memory Access Patterns using Architecture-Independent Program Features},
  year = {2020}
}
Choi J, Richards DF and Kale LV (2020), "Achieving Computation-Communication Overlapwith Overdecomposition on GPU Systems", In Proceedings of the 2020 IEEE/ACM nternational Workshop on Extreme Scale Programming Models and Middleware.
Abstract: The landscape of high performance computing is shifting towards a collection of multi-GPU nodes, widening the gap between on-node compute and off-node communication capabilities. Consequently, the ability to tolerate communication latencies and maximize utilization of the compute hardware are becoming increasingly important in achieving high performance. Overdecomposition has been successfully adopted on traditional CPU-based systems to achieve computation-communication overlap, significantly reducing the impact of communication on application performance. However, it has been unclear whether overdecomposition can provide the same benefits on modern GPU systems. In this work, we address the challenges in achieving computation-communication overlap with overdecomposition on GPU systems using the Charm++ parallel programming system. By prioritizing communication with CUDA streams in the application and supporting asynchronous progress of GPU operations in the Charm++ runtime system, we obtain improvements in overall performance of up to 50% and 47% with proxy applications Jacobi3D and MiniMD, respectively.
BibTeX:
@inproceedings{Choi2020,
  author = {Jaemin Choi and David F. Richards and Laxmikant V. Kale},
  title = {Achieving Computation-Communication Overlapwith Overdecomposition on GPU Systems},
  booktitle = {Proceedings of the 2020 IEEE/ACM nternational Workshop on Extreme Scale Programming Models and Middleware},
  year = {2020}
}
Choi J, Richards DF and Kale LV (2020), "Achieving Computation-Communication Overlapwith Overdecomposition on GPU Systems", In Proceedings of the IEEE/ACM 5th International Workshop on Extreme Scale Programming Models and Middleware.
Abstract: The landscape of high performance computing is shifting towards a collection of multi-GPU nodes, widening the gap between on-node compute and off-node communication capabilities. Consequently, the ability to tolerate communication latencies and maximize utilization of the compute hardware are becoming increasingly important in achieving high performance. Overdecomposition has been successfully adopted on traditional CPU-based systems to achieve computation-communication overlap, significantly reducing the impact of communication on application performance. However, it has been unclear whether overdecomposition can provide the same benefits on modern GPU systems. In this work, we address the challenges in achieving computation-communication overlap with overdecomposition on GPU systems using the Charm++ parallel programming system. By prioritizing communication with CUDA streams in the application and supporting asynchronous progress of GPU operations in the Charm++ runtime system, we obtain improvements in overall performance of up to 50% and 47% with