Numerical algorithms, optimisation, and computer performance: a selection of research from 2020

I won’t bore you with clichés about how weird and difficult this year has been. Instead, I’d like to point out straightaway how large this year’s selection of papers is: more than 400 papers, which is more than twice the papers I selected last year.

This year, I haven’t changed my Google Scholar query:

algorithm sparse method parallel performance matrix numerical optimization graph

but I have added some more supplemental papers, as I found fit.

In the coming year, I expect to add some papers about Computational Sustainability, a field I’m getting very interested in.

An analysis of the data

Load data and fit LDA model to find the topics.
rng('default');
textData = string(fileread("bibliography.txt"));
textData = splitlines(textData);
textData = iCleanBibliographyText(textData);
documents = iPrepareDocument(textData);
[bagTraining, docsTest] = iSplitTrainingTest(documents, 0.8);
Train a number of models, changing the number of topics.
allNumTopics = 2:20;
sizeAllNumTopics = numel(allNumTopics);
mdl = cell(sizeAllNumTopics, 1);
logProbabilities = cell(sizeAllNumTopics, 1);
meanLogProbabilities = zeros(sizeAllNumTopics, 1);
perplexities = cell(sizeAllNumTopics, 1);
meanPerplexities = zeros(sizeAllNumTopics, 1);
for k = 1:sizeAllNumTopics
rng('default');
numTopics = allNumTopics(k);
mdl{k} = fitlda(bagTraining, numTopics, 'Verbose', 0);
logProbabilities{k} = logp(mdl{k}, bagTraining, 'NumSamples', 200);
[~, perplexities{k}] = logp(mdl{k}, docsTest, 'NumSamples', 200);
meanLogProbabilities(k) = mean(logProbabilities{k});
meanPerplexities(k) = mean(perplexities{k});
end
Compare the models, plot log-probabilities over the training set and perplexities over the test set.
figure;
title("Model comparison");
subplot(1, 2, 1);
plot(allNumTopics, meanLogProbabilities);
ylabel("Log-probability on the training set");
xlabel("Number of topics");
subplot(1, 2, 2);
plot(allNumTopics, meanPerplexities);
ylabel("Perplexities on the training set");
xlabel("Number of topics");
Pick the model with the best performance on the test set.
[~, idxBestModel] = min(meanPerplexities);
bestModel = mdl{idxBestModel};
bestNumTopics = allNumTopics(idxBestModel);
iDisplayTopics(bestModel);
Topic 1: problem solution solve show order case find apply introduce approximate Topic 2: propose base many number different computational cost set good size Topic 3: numerical method first experiment accuracy fast novel two accelerate compare Topic 4: datum performance provide new require execution challenge various process feature Topic 5: model optimization learn structure network machine parameter stateoftheart present largescale Topic 6: implementation compute performance achieve gpu gpus test technique architecture design Topic 7: method problem gradient iteration complexity analysis condition optimization exist particular Topic 8: matrix memory parallel block communication distribute computation partition sparse datum Topic 9: application program code analysis performance include study hpc framework exist Topic 10: system solver large linear equation low nonlinear result example lead Topic 11: algorithm result work improve propose term vector constraint version mean Topic 12: approach time reduce work speedup well three implement experimental balance Topic 13: algorithm sparse matrix linear parallel present factorization demonstrate iterative direct Topic 14: library system paper scientific software hardware simulation highperformance spmv application Topic 15: tensor result tool paper show two decomposition factor guarantee input Topic 16: convergence function point convex rate global stochastic local bound class Topic 17: error compute precision runtime operation scale algebra computation overhead arithmetic Topic 18: graph cluster algorithm spectral random nod sequence edge quadratic match
iDisplayLogProbability(logProbabilities{idxBestModel});
Show how the best model categorises all the documents
topicMixture = transform(bestModel, docsTest);
iDisplayTopicMixtures(topicMixture, bestNumTopics);
Compare the topics found by the model
allTopicMixtures = transform(bestModel, documents);
iDisplayClosenness(allTopicMixtures);

Helper functions

function cleanDocuments = iCleanBibliographyText(document)
% Clean the input array of strings, making sure each string
% corresponds to a bibliography entry.
stringSizes = strlength(document);
document(stringSizes==0) = [];
isAbstract = contains(document, "Abstract:");
idxAbstract = find(isAbstract);
% Previous abstract ends two rows before the new.
idxEndAbstract = [idxAbstract-2; numel(document)];
idxEndAbstract(1) = [];
cleanDocuments = repmat("", [numel(idxAbstract), 1]);
% Select each abstract
for k = 1:numel(idxAbstract)
currAbstractStart = idxAbstract(k);
currAbstractEnd = idxEndAbstract(k);
cleanDocuments(k) = join(document(currAbstractStart:currAbstractEnd));
end
cleanDocuments(strlength(cleanDocuments)==0 | ismissing(cleanDocuments)) = [];
end
function documents = iPrepareDocument(textData)
% Prepare the document for analysis
documents = tokenizedDocument(textData);
documents = removeStopWords(documents);
documents = erasePunctuation(documents);
documents = removeShortWords(documents, 2);
documents = removeWords(documents, ["Abstract", "2020", "arXiv"]);
documents = normalizeWords(documents, 'Style', 'lemma');
end
function iDisplayTopics(mdl)
% Display the topics in the model
figure
numTopics = mdl.NumTopics;
for idxTopic = 1:numTopics
topicWords = join(iFindTopicWords(idxTopic), " ");
disp("Topic " + idxTopic + ": " + topicWords);
end
function words = iFindTopicWords(idxTopic)
wordProbabilities = mdl.TopicWordProbabilities(:, idxTopic);
wordList = mdl.Vocabulary;
[~, idxTopProbabilities] = maxk(wordProbabilities, 10);
words = wordList(idxTopProbabilities);
end
end
function iDisplayTopicMixtures(topicMixtures, numTopics)
% Display the probabily of topics by document
figure
area(topicMixtures);
ylim([0 1])
xlim([0, size(topicMixtures, 1)]);
title("Topic mixtures over the test set")
ylabel("Topic probability")
xlabel("Document number")
legend("Topic " + string(1:numTopics),'Location','northeastoutside')
end
function iDisplayLogProbability(logProbabilities)
% Display a log-probability diagram
figure
histogram(logProbabilities)
xlabel("Log probability")
ylabel("Frequency")
title("Document log-probabilities")
end
function iDisplayClosenness(allTopicMixtures)
% Display the closenness between
topicNorms = vecnorm(allTopicMixtures, 2);
numTopics = numel(topicNorms);
topicCloseness = (allTopicMixtures')*(allTopicMixtures)./((topicNorms')*topicNorms);
figure
currCol = pcolor(topicCloseness);
colorbar('location', 'EastOutside');
xlabel("Topic number");
ylabel("Topic number");
title("Closeness between topics");
plotAxis = currCol.Parent;
plotAxis.XTick = 1:numTopics;
plotAxis.YTick = 1:numTopics;
plotAxis.YDir = 'reverse';
pbaspect(plotAxis, [1, 1, 1]);
end
function [bagTraining, docsTest] = iSplitTrainingTest(documents, ratio)
% Split training set and test set
numDocuments = numel(documents);
numTraining = round(ratio*numDocuments);
documents = documents(randperm(numDocuments));
bagTraining = bagOfWords(documents(1:numTraining));
docsTest = documents(numTraining+1:end);
end

From the data analysis, I see:

Matching entries: 0
settings...
Abdelfattah A, Tomov S and Dongarra J (2020), "Investigating the Benefit of FP16-Enabled Mixed-Precision Solvers for Symmetric Positive Definite Matrices Using GPUs", In Lecture Notes in Computer Science. , pp. 237-250. Springer International Publishing.
Abstract: Half-precision computation refers to performing floating-point operations in a 16-bit format. While half-precision has been driven largely by machine learning applications, recent algorithmic advances in numerical linear algebra have discovered beneficial use cases for half precision in accelerating the solution of linear systems of equations at higher precisions. In this paper, we present a high-performance, mixed-precision linear solver (Ax=b) for symmetric positive definite systems in double-precision using graphics processing units (GPUs). The solver is based on a mixed-precision Cholesky factorization that utilizes the high-performance tensor core units in CUDA-enabled GPUs. Since the Cholesky factors are affected by the low precision, an iterative refinement (IR) solver is required to recover the solution back to double-precision accuracy. Two different types of IR solvers are discussed on a wide range of test matrices. A preprocessing step is also developed, which scales and shifts the matrix, if necessary, in order to preserve its positive-definiteness in lower precisions. Our experiments on the V100 GPU show that performance speedups are up to 4.7× against a direct double-precision solver. However, matrix properties such as the condition number and the eigenvalue distribution can affect the convergence rate, which would consequently affect the overall performance.
BibTeX:
@incollection{Abdelfattah2020,
  author = {Ahmad Abdelfattah and Stan Tomov and Jack Dongarra},
  title = {Investigating the Benefit of FP16-Enabled Mixed-Precision Solvers for Symmetric Positive Definite Matrices Using GPUs},
  booktitle = {Lecture Notes in Computer Science},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {237--250},
  doi = {10.1007/978-3-030-50417-5_18}
}
Abdelfattah A, Tomov S and Dongarra J (2020), "Matrix multiplication on batches of small matrices in half and half-complex precisions", Journal of Parallel and Distributed Computing., 11, 2020. Vol. 145, pp. 188-201. Elsevier BV.
Abstract: Machine learning and artificial intelligence (AI) applications often rely on performing many small matrix operations -- in particular general matrix-matrix multiplication (GEMM). These operations are usually performed in a reduced precision, such as the 16-bit floating-point format (i.e., half precision or FP16). The GEMM operation is also very important for dense linear algebra algorithms, and half-precision GEMM operations can be used in mixed-precision linear solvers. Therefore, high-performance batched GEMM operations in reduced precision are significantly important, not only for deep learning frameworks, but also for scientific applications that rely on batched linear algebra, such as tensor contractions and sparse direct solvers.\ This paper presents optimized batched GEMM kernels for graphics processing units (GPUs) in FP16 arithmetic. The paper addresses both real and complex half-precision computations on the GPU. The proposed design takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. With eight tuning parameters introduced in the design, the developed kernels have a high degree of flexibility that overcomes the limitations imposed by the hardware and software (in the form of discrete configurations for the Tensor Core APIs). For real FP16 arithmetic, performance speedups are observed against cuBLAS for sizes up to 128, and range between 1.5× and 2.5×. For the complex FP16 GEMM kernel, the speedups are between 1.7× and 7× thanks to a design that uses the standard interleaved matrix layout, in contrast with the planar layout required by the vendor's solution. The paper also discusses special optimizations for extremely small matrices, where even higher performance gains are achievable.
BibTeX:
@article{Abdelfattah2020b,
  author = {Ahmad Abdelfattah and Stanimire Tomov and Jack Dongarra},
  title = {Matrix multiplication on batches of small matrices in half and half-complex precisions},
  journal = {Journal of Parallel and Distributed Computing},
  publisher = {Elsevier BV},
  year = {2020},
  volume = {145},
  pages = {188--201},
  doi = {10.1016/j.jpdc.2020.07.001}
}
Acebrón JA (2020), "A Probabilistic Linear Solver Based on a Multilevel Monte Carlo Method", Journal of Scientific Computing., 2, 2020. Vol. 82(3) Springer Science and Business Media LLC.
Abstract: We describe a new Monte Carlo method based on a multilevel method for computing the action of the resolvent matrix over a vector. The method is based on the numerical evaluation of the Laplace transform of the matrix exponential, which is computed efficiently using a multilevel Monte Carlo method. Essentially, it requires generating suitable random paths which evolve through the indices of the matrix according to the probability law of a continuous-time Markov chain governed by the associated Laplacian matrix. The convergence of the proposed multilevel method has been discussed, and several numerical examples were run to test the performance of the algorithm. These examples concern the computation of some metrics of interest in the analysis of complex networks, and the numerical solution of a boundary-value problem for an elliptic partial differential equation. In addition, the algorithm was conveniently parallelized, and the scalability analyzed and compared with the results of other existing Monte Carlo method for solving linear algebra systems.
BibTeX:
@article{Acebron2020,
  author = {Juan A. Acebrón},
  title = {A Probabilistic Linear Solver Based on a Multilevel Monte Carlo Method},
  journal = {Journal of Scientific Computing},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  volume = {82},
  number = {3},
  doi = {10.1007/s10915-020-01168-2}
}
Acer S and Aykanat C (2020), "Reordering sparse matrices into block-diagonal column-overlapped form", Journal of Parallel and Distributed Computing., 3, 2020. Elsevier BV.
Abstract: Many scientific and engineering applications necessitate computing the minimum norm solution of a sparse underdetermined linear system of equations. The minimum 2-norm solution of such systems can be obtained by a recent parallel algorithm, whose numerical effectiveness and parallel scalability are validated in both shared- and distributed-memory architectures. This parallel algorithm assumes the coefficient matrix in a block-diagonal column-overlapped (BDCO) form, which is a variant of the block-diagonal form where the successive diagonal blocks may overlap along their columns. The total overlap size of the BDCO form is an important metric in the performance of the subject parallel algorithm since it determines the size of the reduced system, solution of which is a bottleneck operation in the parallel algorithm. In this work, we propose a hypergraph partitioning model for reordering sparse matrices into BDCO form with the objective of minimizing the total overlap size and the constraint of maintaining balance on the number of nonzeros of the diagonal blocks. Our model makes use of existing partitioning tools that support fixed vertices in the recursive bipartitioning paradigm. Experimental results validate the use of our model as it achieves small overlap size and balanced diagonal blocks.
BibTeX:
@article{Acer2020,
  author = {Seher Acer and Cevdet Aykanat},
  title = {Reordering sparse matrices into block-diagonal column-overlapped form},
  journal = {Journal of Parallel and Distributed Computing},
  publisher = {Elsevier BV},
  year = {2020},
  doi = {10.1016/j.jpdc.2020.03.002}
}
Aggarwal CC (2020), "The Linear Algebra of Similarity", In Linear Algebra and Optimization for Machine Learning. , pp. 379-410. Springer International Publishing.
Abstract: A dot-product similarity matrix is an alternative way to represent a multidimensional data set. In other words, one can convert an n × d data matrix D into an n × n similarity matrix S = DD^T (which contains n^2 pairwise dot products between points). One can use S instead of D for machine learning algorithms. The reason is that the similarity matrix contains almost the same information about the data as the original matrix. This equivalence is the genesis of a large class of methods in machine learning, referred to as kernel methods. This chapter builds the linear algebra framework required for understanding this important class of methods in machine learning. The real utility of such methods arises when the similarity matrix is chosen differently from the use of dot products (and the data matrix is sometimes not even available).
BibTeX:
@incollection{Aggarwal2020,
  author = {Charu C. Aggarwal},
  title = {The Linear Algebra of Similarity},
  booktitle = {Linear Algebra and Optimization for Machine Learning},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {379--410},
  doi = {10.1007/978-3-030-40344-7_9}
}
Agullo E, Cools S, Fatih-Yetkin E, Giraud L, Schenkel N and Vanroose W (2020), "On soft errors in the Conjugate Gradient method: sensitivity and robust numerical detection". Thesis at: INRIA.
Abstract: The conjugate gradient (CG) method is the most widely used iterative scheme for the solution of large sparse systems of linear equations when the matrix is symmetric positive definite. Although more than sixty year old, it is still a serious candidate for extreme-scale computation on large computing platforms. On the technological side, the continuous shrinking of transistor geometry and the increasing complexity of these devices affect dramatically their sensitivity to natural radiation, and thus diminish their reliability. One of the most common effects produced by natural radiation is the single event upset which consists in a bit-flip in a memory cell producing unexpected results at application level. Consequently, the future computing facilities at extreme scale might be more prone to errors of any kind including bit-flip during calculation. These numerical and technological observations are the main motivations for this work, where we first investigate through extensive numerical experiments the sensitivity of CG to bit-flips in its main computationally intensive kernels, namely the matrix-vector product and the preconditioner application. We further propose numerical criteria to detect the occurrence of such soft errors; we assess their robustness through extensive numerical experiments.
BibTeX:
@techreport{Agullo2020,
  author = {Emmanuel Agullo and Siegfried Cools and Emrullah Fatih-Yetkin and Luc Giraud and Nick Schenkel and Wim Vanroose},
  title = {On soft errors in the Conjugate Gradient method: sensitivity and robust numerical detection},
  school = {INRIA},
  year = {2020},
  url = {https://hal.inria.fr/hal-02495301}
}
Agullo E, Altenbernd M, Anzt H, Bautista-Gomez L, Benacchio T, Bonaventura L, Bungartz H-J, Chatterjee S, Ciorba FM, DeBardeleben N, Drzisga D, Eibl S, Engelmann C, Gansterer WN, Giraud L, Goeddeke D, Heisig M, Jezequel F, Kohl N, Li XS, Lion R, Mehl M, Mycek P, Obersteiner M, Quintana-Orti ES, Rizzi F, Ruede U, Schulz M, Fung F, Speck R, Stals L, Teranishi K, Thibault S, Thoennes D, Wagner A and Wohlmuth B (2020), "Resiliency in Numerical Algorithm Design for Extreme Scale Simulations", October, 2020.
Abstract: This work is based on the seminar titled ``Resiliency in Numerical Algorithm Design for Extreme Scale Simulations'' held March 1-6, 2020 at Schloss Dagstuhl, that was attended by all the authors. Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge.
BibTeX:
@article{Agullo2020a,
  author = {Emmanuel Agullo and Mirco Altenbernd and Hartwig Anzt and Leonardo Bautista-Gomez and Tommaso Benacchio and Luca Bonaventura and Hans-Joachim Bungartz and Sanjay Chatterjee and Florina M. Ciorba and Nathan DeBardeleben and Daniel Drzisga and Sebastian Eibl and Christian Engelmann and Wilfried N. Gansterer and Luc Giraud and Dominik Goeddeke and Marco Heisig and Fabienne Jezequel and Nils Kohl and Xiaoye Sherry Li and Romain Lion and Miriam Mehl and Paul Mycek and Michael Obersteiner and Enrique S. Quintana-Orti and Francesco Rizzi and Ulrich Ruede and Martin Schulz and Fred Fung and Robert Speck and Linda Stals and Keita Teranishi and Samuel Thibault and Dominik Thoennes and Andreas Wagner and Barbara Wohlmuth},
  title = {Resiliency in Numerical Algorithm Design for Extreme Scale Simulations},
  year = {2020}
}
Ahmad N, Yilmaz B and Unat D (2020), "A Prediction Framework for Fast Sparse Triangular Solves", In Euro-Par 2020: Parallel Processing. , pp. 529-545. Springer International Publishing.
Abstract: Sparse triangular solve (SpTRSV) is an important linear algebra kernel, finding extensive uses in numerical and scientific computing. The parallel implementation of SpTRSV is a challenging task due to the sequential nature of the steps involved. This makes it, in many cases, one of the most time-consuming operations in an application. Many approaches for efficient SpTRSV on CPU and GPU systems have been proposed in the literature. However, no single implementation or platform (CPU or GPU) gives the fastest solution for all input sparse matrices. In this work, we propose a machine learning-based framework to predict the SpTRSV implementation giving the fastest execution time for a given sparse matrix based on its structural features. The framework is tested with six SpTRSV implementations on a state-of-the-art CPU-GPU machine (Intel Xeon Gold CPU, NVIDIA V100 GPU). Experimental results, with 998 matrices taken from the SuiteSparse Matrix Collection, show the classifier prediction accuracy of 87% for the fastest SpTRSV algorithm for a given input matrix. Predicted SpTRSV implementations achieve average speedups (harmonic mean) in the range of 1.4--2.7× against the six SpTRSV implementations used in the evaluation.
BibTeX:
@incollection{Ahmad2020,
  author = {Najeeb Ahmad and Buse Yilmaz and Didem Unat},
  title = {A Prediction Framework for Fast Sparse Triangular Solves},
  booktitle = {Euro-Par 2020: Parallel Processing},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {529--545},
  doi = {10.1007/978-3-030-57675-2_33}
}
Ahmadi AA and Zhang J (2020), "Complexity aspects of local minima and related notions", August, 2020.
Abstract: We consider the notions of (i) critical points, (ii) second-order points, (iii) local minima, and (iv) strict local minima for multivariate polynomials. For each type of point, and as a function of the degree of the polynomial, we study the complexity of deciding (1) if a given point is of that type, and (2) if a polynomial has a point of that type. Our results characterize the complexity of these two questions for all degrees left open by prior literature. Our main contributions reveal that many of these questions turn out to be tractable for cubic polynomials. In particular, we present an efficiently-checkable necessary and sufficient condition for local minimality of a point for a cubic polynomial. We also show that a local minimum of a cubic polynomial can be efficiently found by solving semidefinite programs of size linear in the number of variables. By contrast, we show that it is strongly NP-hard to decide if a cubic polynomial has a critical point. We also prove that the set of second-order points of any cubic polynomial is a spectrahedron, and conversely that any spectrahedron is the projection of the set of second-order points of a cubic polynomial. In our final section, we briefly present a potential application of finding local minima of cubic polynomials to the design of a third-order Newton method.
BibTeX:
@article{Ahmadi2020,
  author = {Amir Ali Ahmadi and Jeffrey Zhang},
  title = {Complexity aspects of local minima and related notions},
  year = {2020}
}
Ahrens P, Demmel J and Nguyen H (2020), "Algorithms for Efficient Reproducible Floating Point Summation", ACM Transactions on Mathematical Software. New York, NY, USA Vol. 0(ja) Association for Computing Machinery.
Abstract: We define reproducibility to mean getting bitwise identical results from multiple runs of the same program, perhaps with different hardware resources or other changes that should not change the answer. Many users depend on reproducibility for debugging or correctness. However, dynamic scheduling of parallel computing resources, with nonassociativity of floating-point addition, makes reproducibility a challenge even for summing a vector of numbers, or the Basic Linear Algebra Subprograms (BLAS). We describe an algorithm that computes a reproducible floating point sum, independent of summation order. The algorithm uses only a subset of IEEE Standard 754-2008. It is communication-optimal, i.e. it does just one pass over the data in the sequential case, or one reduction operation in parallel, requiring an accumulator of just 6 words (higher precision is possible). The arithmetic cost is 7n additions to sum n words, and the error bound can be up to 10^(-8) times smaller than for conventional summation. We describe the algorithm, the software infrastructure for reproducible BLAS (ReproBLAS), and performance results. For example, for the dot product of 4096 doubles, we get a 4× slowdown compared to Intel MKL on an Intel Core i7-2600 CPU operating at 3.4 GHz and 256 KB L2 Cache.
BibTeX:
@article{Ahrens2020,
  author = {Ahrens, Peter and Demmel, James and Nguyen, Hong},
  title = {Algorithms for Efficient Reproducible Floating Point Summation},
  journal = {ACM Transactions on Mathematical Software},
  publisher = {Association for Computing Machinery},
  year = {2020},
  volume = {0},
  number = {ja},
  url = {https://dl.acm.org/doi/abs/10.1145/3389360},
  doi = {10.1145/3389360}
}
Ahrens P and Boman EG (2020), "On Optimal Partitioning For Sparse Matrices In Variable Block Row Format", May, 2020.
Abstract: The Variable Block Row (VBR) format is an influential blocked sparse matrix format designed to represent shared sparsity structure between adjacent rows and columns. VBR consists of groups of adjacent rows and columns, storing the resulting blocks that contain nonzeros in a dense format. This reduces the memory footprint and enables optimizations such as register blocking and instruction-level parallelism. Existing approaches use heuristics to determine which rows and columns should be grouped together. We adapt and optimize a dynamic programming algorithm for sequential hypergraph partitioning to produce a linear time algorithm which can determine the optimal partition of rows under an expressive cost model, assuming the column partition remains fixed. Furthermore, we show that the problem of determining an optimal partition for the rows and columns simultaneously is NP-Hard under a simple linear cost model. To evaluate our algorithm empirically against existing heuristics, we introduce the 1D-VBR format, a specialization of VBR format where columns are left ungrouped. We evaluate our algorithms on all 1626 real-valued matrices in the SuiteSparse Matrix Collection. When asked to minimize an empirically derived cost model for a sparse matrix-vector multiplication kernel, our algorithm produced partitions whose 1D-VBR realizations achieve a speedup of at least 1.18 over an unblocked kernel on 25% of the matrices, and a speedup of at least 1.59 on 12.5% of the matrices. The 1D-VBR representation produced by our algorithm had faster SpMVs than the 1D-VBR representations produced by any existing heuristics on 87.8% of the test matrices.
BibTeX:
@article{Ahrens2020a,
  author = {Peter Ahrens and Erik G. Boman},
  title = {On Optimal Partitioning For Sparse Matrices In Variable Block Row Format},
  year = {2020}
}
Ahrens P (2020), "Load Plus Communication Balancing in Contiguous Partitions for Distributed Sparse Matrices: Linear-Time Algorithms", July, 2020.
Abstract: We study partitioning to parallelize multiplication of one or more dense vectors by a sparse matrix (SpMV or SpMM). We consider contiguous partitions, where the rows (or columns) of the matrix are split into K parts without reordering. We present exact and approximate contiguous partitioning algorithms that minimize the runtime of the longest-running processor under cost models that combine work factors and hypergraph communication factors. This differs from traditional graph or hypergraph partitioning models which minimize total communication under a work balance constraint. We address regimes where partitions of the row space and column space are expected to match (the symmetric case) or are allowed to differ (the nonsymmetric case). Our algorithms use linear space. Our exact algorithm runs in linear time when K^2 is sublinear. Our (1 + )-approximate algorithm runs in linear time when K(1/) is sublinear. We combine concepts from high-performance computing and computational geometry. Existing load balancing algorithms optimize a linear model of per-processor work. We make minor adaptations to optimize arbitrary nonuniform monotonic increasing or decreasing cost functions which may be expensive to evaluate. We then show that evaluating our model of communication is equivalent to planar dominance counting. We specialize Chazelle's dominance counting algorithm to points in the bounded integer plane and generalize it to trade reduced construction time for increased query time, since our partitioners make very few queries. Our algorithms split the original row (or column) ordering into parts to optimize diverse cost models. Combined with reordering or embedding techniques, our algorithms might be used to build more general heuristic partitioners, as they can optimally round one-dimensional embeddings of direct K-way noncontiguous partitioning problems.
BibTeX:
@article{Ahrens2020b,
  author = {Peter Ahrens},
  title = {Load Plus Communication Balancing in Contiguous Partitions for Distributed Sparse Matrices: Linear-Time Algorithms},
  year = {2020}
}
AlAhmadi S, Muhammed T, Mehmood R and Albeshri A (2020), "Performance Characteristics for Sparse Matrix-Vector Multiplication on GPUs", In Smart Infrastructure and Applications: Foundations for Smarter Cities and Societies. Cham , pp. 409-426. Springer International Publishing.
Abstract: The massive parallelism provided by the graphics processing units (GPUs) offers tremendous performance in many high-performance computing applications. One such application is Sparse Matrix-Vector (SpMV) multiplication, which is an essential building block for numerous scientific and engineering applications. Researchers who propose new storage techniques for sparse matrix-vector multiplication focus mainly on a single evaluation metrics or performance characteristics which is usually the throughput performance of sparse matrix-vector multiplication in FLOPS. However, such an evaluation does not provide a deeper insight nor allow to compare new SpMV techniques with their competitors directly. In this chapter, we explain the notable performance characteristics of the GPU architectures and SpMV computations. We discuss various strategies to improve the performance of SpMV on GPUs. We also discuss a few performance criteria that are usually overlooked by the researchers during the evaluation process. We also analyze various well-known schemes such as COO, CSR, ELL, DIA, HYB, and CSR5 using the discussed performance characteristics.
BibTeX:
@inbook{AlAhmadi2020,
  author = {AlAhmadi, Sarah and Muhammed, Thaha and Mehmood, Rashid and Albeshri, Aiiad},
  editor = {Mehmood, Rashid and See, Simon and Katib, Iyad and Chlamtac, Imrich},
  title = {Performance Characteristics for Sparse Matrix-Vector Multiplication on GPUs},
  booktitle = {Smart Infrastructure and Applications: Foundations for Smarter Cities and Societies},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {409--426},
  doi = {10.1007/978-3-030-13705-2_17}
}
Alappat CL, Alvermann A, Basermann A, Fehske H, Futamura Y, Galgon M, Hager G, Huber S, Imakura A, Kawai M, Kreutzer M, Lang B, Nakajima K, Röhrig-Zöllner M, Sakurai T, Shahzad F, Thies J and Wellein G (2020), "ESSEX: Equipping Sparse Solvers For Exascale", In Software for Exascale Computing -- SPPEXA 2016--2019. , pp. 143-187. Springer International Publishing.
Abstract: The ESSEX project has investigated programming concepts, data structures, and numerical algorithms for scalable, efficient, and robust sparse eigenvalue solvers on future heterogeneous exascale systems. Starting without the burden of legacy code, a holistic performance engineering process could be deployed across the traditional software layers to identify efficient implementations and guide sustainable software development. At the basic building blocks level, a flexible MPI+X programming approach was implemented together with a new sparse data structure (SELL-C-σ) to support heterogeneous architectures by design. Furthermore, ESSEX focused on hardware-efficient kernels for all relevant architectures and efficient data structures for block vector formulations of the eigensolvers. The algorithm layer addressed standard, generalized, and nonlinear eigenvalue problems and provided some widely usable solver implementations including a block Jacobi--Davidson algorithm, contour-based integration schemes, and filter polynomial approaches. Adding to the highly efficient kernel implementations, algorithmic advances such as adaptive precision, optimized filtering coefficients, and preconditioning have further improved time to solution. These developments were guided by quantum physics applications, especially from the field of topological insulator- or graphene-based systems. For these, ScaMaC, a scalable matrix generation framework for a broad set of quantum physics problems, was developed. As the central software core of ESSEX, the PHIST library for sparse systems of linear equations and eigenvalue problems has been established. It abstracts algorithmic developments from low-level optimization. Finally, central ESSEX software components and solvers have demonstrated scalability and hardware efficiency on up to 256 K cores using million-way process/thread-level parallelism.
BibTeX:
@inproceedings{Alappat2020,
  author = {Alappat, Christie L. and Alvermann, Andreas and Basermann, Achim and Fehske, Holger and Futamura, Yasunori and Galgon, Martin and Hager, Georg and Huber, Sarah and Imakura, Akira and Kawai, Masatoshi and Kreutzer, Moritz and Lang, Bruno and Nakajima, Kengo and Röhrig-Zöllner, Melven and Sakurai, Tetsuya and Shahzad, Faisal and Thies, Jonas and Wellein, Gerhard},
  editor = {Bungartz, Hans-Joachim and Reiz, Severin and Uekermann, Benjamin and Neumann, Philipp and Nagel, Wolfgang E.},
  title = {ESSEX: Equipping Sparse Solvers For Exascale},
  booktitle = {Software for Exascale Computing -- SPPEXA 2016--2019},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {143--187}
}
Alappat CL, Alvermann A, Basermann A, Fehske H, Futamura Y, Galgon M, Hager G, Huber S, Imakura A, Kawai M, Kreutzer M, Lang B, Nakajima K, Röhrig-Zöllner M, Sakurai T, Shahzad F, Thies J and Wellein G (2020), "Equipping Sparse Solvers For Exascale", Software for Exascale Computing -- SPPEXA 2016-2019. , pp. 143-187.
Abstract: The ESSEX project has investigated programming concepts, data structures, and numerical algorithms for scalable, efficient, and robust sparse eigenvalue solvers on future heterogeneous exascale systems. Starting without the burden of legacy code, a holistic performance engineering process could be deployed across the traditional software layers to identify efficient implementations and guide sustainable software development. At the basic building blocks level, a flexible MPI+X programming approach was implemented together with a new sparse data structure (SELL-C-σ) to support heterogeneous architectures by design. Furthermore, ESSEX focused on hardware-efficient kernels for all relevant architectures and efficient data structures for block vector formulations of the eigensolvers. The algorithm layer addressed standard, generalized, and nonlinear eigenvalue problems and provided some widely usable solver implementations including a block JacobiDavidson algorithm, contour-based integration schemes, and filter polynomial approaches. Adding to the highly efficient kernel implementations, algorithmic advances such as adaptive precision, optimized filtering coefficients, and preconditioning have further improved time to solution. These developments were guided by the field of quantum physics applications, and especially by current topics such as topological insulator systems or problems from graphene research. For these, ScaMaC, a scalable matrix generation framework for a broad set of quantum physics problems, was developed. As the central software core of ESSEX, the PHIST library for sparse linear and eigenvalue problems has been established. It abstracts algorithmic developments from low-level optimization. Finally, central ESSEX software components and solvers have demonstrated scalability and hardware efficiency on up to 256 K cores using million-way process/thread-level parallelism.
BibTeX:
@article{Alappat2020a,
  author = {Christie L. Alappat and Andreas Alvermann and Achim Basermann and Holger Fehske and Yasunori Futamura and Martin Galgon and Georg Hager and Sarah Huber and Akira Imakura and Masatoshi Kawai and Moritz Kreutzer and Bruno Lang and Kengo Nakajima and Melven Röhrig-Zöllner and Tetsuya Sakurai and Faisal Shahzad and Jonas Thies and Gerhard Wellein},
  title = {Equipping Sparse Solvers For Exascale},
  journal = {Software for Exascale Computing -- SPPEXA 2016-2019},
  year = {2020},
  pages = {143--187}
}
Alghunaim SA (2020), "On the Performance and Linear Convergence of Decentralized Primal-Dual Methods". Thesis at: University of California Los Angeles.
Abstract: This dissertation studies the performance and linear convergence properties of primal-dual methods for the solution of decentralized multi-agent optimization problems. Decentralized multi-agent optimization is a powerful paradigm that finds applications in diverse fields in learning and engineering design. In these setups, a network of agents is connected through some topology and agents are allowed to share information only locally. Their overall goal is to seek the minimizer of a global optimization problem through localized interactions. In decentralized consensus problems, the agents are coupled through a common consensus variable that they need to agree upon. While in decentralized resource allocation problems, the agents are coupled through global affine constraints.\ Various decentralized consensus optimization algorithms already exist in the literature. Some methods are derived from a primal-dual perspective, while other methods are derived as gradient tracking mechanisms meant to track the average of local gradients. Among the gradient tracking methods are the adapt-then-combine implementations motivated by diffusion strategies, which have been observed to perform better than other implementations. In this dissertation, we develop a novel adapt-then-combine primal-dual algorithmic framework that captures most state-of-the-art gradient based methods as special cases including all the variations of the gradient-tracking methods. We also develop a concise and novel analysis technique that establishes the linear convergence of this general framework under stronglyii convex objectives. Due to our unified framework, the analysis reveals important characteristics for these methods such as their convergence rates and step-size stability ranges. Moreover, the analysis reveals how the augmented Lagrangian penalty term, which is utilized in most of these methods, affects the performance of decentralized algorithms.\ Another important question that we answer is whether decentralized proximal gradient methods can achieve global linear convergence for non-smooth composite optimization. For centralized algorithms, linear convergence has been established in the presence of a nonsmooth composite term. In this dissertation, we close the gap between centralized and decentralized proximal gradient algorithms and show that decentralized proximal algorithms can also achieve linear convergence in the presence of a non-smooth term. Furthermore, we show that when each agent possesses a different local non-smooth term then global linear convergence cannot be established in the worst case.\ Most works that study decentralized optimization problems assume that all agents are involved in computing all variables. However, in many applications the coupling across agents is sparse in the sense that only a few agents are involved in computing certain variables. We show how to design decentralized algorithms in sparsely coupled consensus and resource allocation problems. More importantly, we establish analytically the importance of exploiting the sparsity structure in coupled large-scale networks.
BibTeX:
@phdthesis{Alghunaim2020,
  author = {Alghunaim, Sulaiman A.},
  title = {On the Performance and Linear Convergence of Decentralized Primal-Dual Methods},
  school = {University of California Los Angeles},
  year = {2020}
}
Al-Harthi N, A.Alomairy RM, Akbudak K, Chen R, Ltaief H, Bagci H and Keyes DE (2020), "Solving Acoustic Boundary Integral Equations Using High Performance Tile Low-Rank LU Factorization", In Proceedings of ISC High Performance 2020., June, 2020.
Abstract: We design and develop a new high performance implementation of a fast direct LU-based solver using low-rank approximations on massively parallel systems. The LU factorization is the most time-consuming step in solving systems of linear equations in the context of analyzing acoustic scattering from large 3D objects. The matrix equation is obtained by discretizing the boundary integral of the exterior Helmholtz problem using a higher-order Nyström scheme. The main idea is to exploit the inherent data sparsity of the matrix operator by performing local tile-centric approximations while still capturing the most significant information. In particular, the proposed LU-based solver leverages the Tile Low-Rank (TLR) data compression format as implemented in the Hierarchical Computations on Manycore Architectures (HiCMA) library to decrease the complexity of “classical” dense direct solvers from cubic to quadratic order. We taskify the underlying boundary integral kernels to expose fine-grained computations. We then employ the dynamic runtime system StarPU to orchestrate the scheduling of computational tasks on shared and distributed-memory systems. The resulting asynchronous execution permits to compensate for the load imbalance due to the heterogeneous ranks, while mitigating the overhead of data motion. We assess the robustness of our TLR LU-based solver and study the qualitative impact when using different numerical accuracies. The new TLR LU factorization outperforms the state-of-the-art dense factorizations by up to an order of magnitude on various parallel systems, for analysis of scattering from large-scale 3D synthetic and real geometries.
BibTeX:
@inproceedings{AlHarthi2020,
  author = {Al-Harthi, Noha and A.Alomairy, Rabab M. and Akbudak, Kadir and Chen, Rui and Ltaief, Hatem and Bagci, Hakan and Keyes, David E.},
  title = {Solving Acoustic Boundary Integral Equations Using High Performance Tile Low-Rank LU Factorization},
  booktitle = {Proceedings of ISC High Performance 2020},
  year = {2020},
  url = {http://hdl.handle.net/10754/663212}
}
Aliaga JI, Anzt H, Grützmacher T, Quintana-Ortí ES and Tomás AE (2020), "Compressed Basis GMRES on High Performance GPUs", September, 2020.
Abstract: Krylov methods provide a fast and highly parallel numerical tool for the iterative solution of many large-scale sparse linear systems. To a large extent, the performance of practical realizations of these methods is constrained by the communication bandwidth in all current computer architectures, motivating the recent investigation of sophisticated techniques to avoid, reduce, and/or hide the message-passing costs (in distributed platforms) and the memory accesses (in all architectures). This paper introduces a new communication-reduction strategy for the (Krylov) GMRES solver that advocates for decoupling the storage format (i.e., the data representation in memory) of the orthogonal basis from the arithmetic precision that is employed during the operations with that basis. Given that the execution time of the GMRES solver is largely determined by the memory access, the datatype transforms can be mostly hidden, resulting in the acceleration of the iterative step via a lower volume of bits being retrieved from memory. Together with the special properties of the orthonormal basis (whose elements are all bounded by 1), this paves the road toward the aggressive customization of the storage format, which includes some floating point as well as fixed point formats with little impact on the convergence of the iterative process. We develop a high performance implementation of the "compressed basis GMRES" solver in the Ginkgo sparse linear algebra library and using a large set of test problems from the SuiteSparse matrix collection we demonstrate robustness and performance advantages on a modern NVIDIA V100 GPU of up to 50% over the standard GMRES solver that stores all data in IEEE double precision.
BibTeX:
@article{Aliaga2020,
  author = {José I. Aliaga and Hartwig Anzt and Thomas Grützmacher and Enrique S. Quintana-Ortí and Andrés E. Tomás},
  title = {Compressed Basis GMRES on High Performance GPUs},
  year = {2020}
}
Alimisis F, Orvieto A, Bécigneul G and Lucchi A (2020), "Practical Accelerated Optimization on Riemannian Manifolds", February, 2020.
Abstract: We develop a new Riemannian descent algorithm with an accelerated rate of convergence. We focus on functions that are geodesically convex or weakly-quasi-convex, which are weaker function classes compared to prior work that has considered geodesically strongly convex functions. Our proof of convergence relies on a novel estimate sequence which allows to demonstrate the dependency of the convergence rate on the curvature of the manifold. We validate our theoretical results empirically on several optimization problems defined on a sphere and on the manifold of positive definite matrices.
BibTeX:
@article{Alimisis2020,
  author = {Foivos Alimisis and Antonio Orvieto and Gary Bécigneul and Aurelien Lucchi},
  title = {Practical Accelerated Optimization on Riemannian Manifolds},
  year = {2020}
}
Alimo R, Beyhaghi P and Bewley TR (2020), "Delaunay-based derivative-free optimization via global surrogates. Part III: nonconvex constraints", Journal of Global Optimization., 1, 2020.
Abstract: This paper introduces a Delaunay-based derivative-free optimization algorithm, dubbed - -DOGS(), for problems with both (a) a nonconvex, computationally expensive objective function f(x), and (b) nonlinear, computationally expensive constraint functions c_l(x) which, taken together, define a nonconvex, possibly even disconnected feasible domain Ω which is assumed to lie within a known rectangular search domain _s, everywhere within which the f(x) and c_l(x) may be evaluated. Approximations of both the objective function f(x) as well as the feasible domain Ω are developed and refined as the iterations proceed. The approach is practically limited to the problems with less than about ten adjustable parameters. The work is an extension of our original Delaunay-based optimization algorithm (see JOGO DOI: 10.1007/s10898-015-0384-2), and inherits many of the constructions and strengths of that algorithm, including: (1) a surrogate function p(x) interpolating all existing function evaluations and summarizing their trends, (2) a synthetic, piecewise-quadratic uncertainty function e(x) built on the framework of a Delaunay triangulation amongst existing datapoints, (3) a tunable balance between global exploration (large K) and local refinement (small K), (4) provable global convergence for a sufficiently large K, under the assumption that the objective and constraint functions are twice differentiable with bounded Hessians, (5) an Adaptive-K variant of the algorithm that efficiently tunes K automatically based on a target value of the objective function, and (6) remarkably fast global convergence on a variety of benchmark problems.
BibTeX:
@article{Alimo2020,
  author = {Alimo, Ryan and Beyhaghi, Pooriya and Bewley, Thomas R.},
  title = {Delaunay-based derivative-free optimization via global surrogates. Part III: nonconvex constraints},
  journal = {Journal of Global Optimization},
  year = {2020},
  doi = {10.1007/s10898-019-00854-2}
}
Allman A and Zhang Q (2020), "Branch-and-Price for a Class of Nonconvex Mixed-Integer Nonlinear Programs", January, 2020.
Abstract: This work attempts to combine the strengths of two major technologies that have matured over the last three decades: global mixed-integer nonlinear optimization and branch-and-price. We consider a class of generally nonconvex mixed-integer nonlinear programs (MINLPs) with linear complicating constraints and integer linking variables. If the complicating constraints are removed, the problem becomes easy to solve, e.g. due to decomposable structure. Integrality of the linking variables allows us to apply a discretization approach to derive a Dantzig-Wolfe reformulation and solve the problem to global optimality using branch-and-price. It is a remarkably simple idea; but to our surprise, it has barely found any application in the literature. In this work, we show that many relevant problems directly fall or can be reformulated into this class of MINLPs. We present the branch-and-price algorithm and demonstrate its effectiveness (and sometimes ineffectiveness) in an extensive computational study considering multiple large-scale problems of practical relevance, showing that, in many cases, orders-of-magnitude reductions in solution time can be achieved.
BibTeX:
@article{Allman2020,
  author = {Andrew Allman and Qi Zhang},
  title = {Branch-and-Price for a Class of Nonconvex Mixed-Integer Nonlinear Programs},
  year = {2020}
}
Altschuler JM and Parrilo PA (2020), "Random Osborne: a simple, practical algorithm for Matrix Balancing in near-linear time", April, 2020.
Abstract: We revisit Matrix Balancing, a pre-conditioning task used ubiquitously for computing eigenvalues and matrix exponentials. Since 1960, Osborne's algorithm has been the practitioners' algorithm of choice, and is now implemented in most numerical software packages. However, the theoretical properties of Osborne's algorithm are not well understood. Here, we show that a simple random variant of Osborne's algorithm converges in near-linear time in the input sparsity. Specifically, it balances K∊ℝ_≥ 0^n× n after O(m-2log) arithmetic operations, where m is the number of nonzeros in K, 𝜖 is the _1 accuracy, and kappa=_ijK_ij/(_ij:K_ij≠ 0K_ij) measures the conditioning of K. Previous work had established near-linear runtimes either only for _2 accuracy (a weaker criterion which is less relevant for applications), or through an entirely different algorithm based on (currently) impractical Laplacian solvers. We further show that if the graph with adjacency matrix K is moderately connected--e.g., if K has at least one positive row/column pair--then Osborne's algorithm initially converges exponentially fast, yielding an improved runtime O(m-1log). We also address numerical precision issues by showing that these runtime bounds still hold when using O((n/))-bit numbers. Our results are established through a potential argument that leverages a convex optimization perspective of Osborne's algorithm, and relates the per-iteration progress to the current imbalance as measured in Hellinger distance. Unlike previous analyses, we critically exploit log-convexity of the potential. Our analysis extends to other variants of Osborne's algorithm: along the way, we establish significantly improved runtime bounds for cyclic, greedy, and parallelized variants.
BibTeX:
@article{Altschuler2020,
  author = {Jason M. Altschuler and Pablo A. Parrilo},
  title = {Random Osborne: a simple, practical algorithm for Matrix Balancing in near-linear time},
  year = {2020}
}
Altschuler JM and Parrilo PA (2020), "Approximating Min-Mean-Cycle for low-diameter graphs in near-optimal time and memory", April, 2020.
Abstract: We revisit Min-Mean-Cycle, the classical problem of finding a cycle in a weighted directed graph with minimum mean weight. Despite an extensive algorithmic literature, previous work falls short of a near-linear runtime in the number of edges m--in fact, there is a natural barrier which precludes such a runtime for solving Min-Mean-Cycle exactly. Here, we give a much faster approximation algorithm that, for graphs with polylogarithmic diameter, has near-linear runtime. In particular, this is the first algorithm whose runtime for the complete graph scales in the number of vertices n as O(n^2). Moreover--unconditionally on the diameter--the algorithm uses only O(n) memory beyond reading the input, making it "memory-optimal". The algorithm is also simple to implement and has remarkable practical performance. Our approach is based on solving a linear programming (LP) relaxation using entropic regularization, which effectively reduces the LP to a Matrix Balancing problem--a la the popular reduction of Optimal Transport to Matrix Scaling. We then round the fractional LP solution using a variant of the classical Cycle-Cancelling algorithm that is sped up to near-linear runtime at the expense of being approximate, and implemented in a memory-optimal manner. We also provide an alternative algorithm with slightly faster theoretical runtime, albeit worse memory usage and practicality. This algorithm uses the same rounding procedure, but solves the LP relaxation by leveraging recent developments in area-convexity regularization. Its runtime scales inversely in the approximation accuracy, which we show is optimal--barring a major breakthrough in algorithmic graph theory, namely faster Shortest Paths algorithms.
BibTeX:
@article{Altschuler2020a,
  author = {Jason M. Altschuler and Pablo A. Parrilo},
  title = {Approximating Min-Mean-Cycle for low-diameter graphs in near-optimal time and memory},
  year = {2020}
}
Alyahya H, Mehmood R and Katib I (2020), "Parallel Iterative Solution of Large Sparse Linear Equation Systems on the Intel MIC Architecture", In Smart Infrastructure and Applications: Foundations for Smarter Cities and Societies. Cham , pp. 377-407. Springer International Publishing.
Abstract: Many important scientific, engineering, and smart city applications require solving large sparse linear equation systems. The numerical methods for solving linear equations can be categorised into direct methods and iterative methods. Jacobi method is one of the iterative solvers that has been widely used due to its simplicity and efficiency. Its performance is affected by factors including the storage format, the specific computational algorithm, and its implementation. While the performance of Jacobi has been studied extensively on conventional CPU architectures, research on its performance on emerging architectures, such as the Intel Many Integrated Core (MIC) architecture, is still in its infancy. In this chapter, we investigate the performance of parallel implementations of the Jacobi method on Knights Corner (KNC), the first generation of the Intel MIC architectures. We implement Jacobi with two storage formats, Compressed Sparse Row (CSR) and Modified Sparse Row (MSR), and measure their performance in terms of execution time, offloading time, and speedup. We report results of sparse matrices with over 28 million rows and 640 million non-zero elements acquired from 13 diverse application domains. The experimental results show that our Jacobi parallel implementation on MIC achieves speedups of up to 27.75× compared to the sequential implementation. It also delivers a speedup of up to 3.81× compared to a powerful node comprising 24 cores in two Intel Xeon E5-2695v2 processors.
BibTeX:
@inbook{Alyahya2020,
  author = {Alyahya, Hana and Mehmood, Rashid and Katib, Iyad},
  editor = {Mehmood, Rashid and See, Simon and Katib, Iyad and Chlamtac, Imrich},
  title = {Parallel Iterative Solution of Large Sparse Linear Equation Systems on the Intel MIC Architecture},
  booktitle = {Smart Infrastructure and Applications: Foundations for Smarter Cities and Societies},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {377--407},
  doi = {10.1007/978-3-030-13705-2_16}
}
Amaral VS, Andreani R, Birgin EG, Marcondes DS and Martínez JM (2020), "On complexity and convergence of high-order coordinate descent algorithms", September, 2020.
Abstract: Coordinate descent methods with high-order regularized models for box-constrained minimization are introduced. High-order stationarity asymptotic convergence and first-order stationarity worst-case evaluation complexity bounds are established. The computer work that is necessary for obtaining first-order varepsilon-stationarity with respect to the variables of each coordinate-descent block is O(-(p+1)/p) whereas the computer work for getting first-order varepsilon-stationarity with respect to all the variables simultaneously is O(-(p+1)). Numerical examples involving multidimensional scaling problems are presented. The numerical performance of the methods is enhanced by means of coordinate-descent strategies for choosing initial points.
BibTeX:
@article{Amaral2020,
  author = {V. S. Amaral and R. Andreani and E. G. Birgin and D. S. Marcondes and J. M. Martínez},
  title = {On complexity and convergence of high-order coordinate descent algorithms},
  year = {2020}
}
Anaissi A, Suleiman B and Zandavi SM (2020), "NeCPD: An Online Tensor Decomposition with Optimal Stochastic Gradient Descent", March, 2020.
Abstract: Multi-way data analysis has become an essential tool for capturing underlying structures in higher-order datasets stored in tensor X ∊ ℝ ^I_1 × dots × I_N. CANDECOMP/PARAFAC (CP) decomposition has been extensively studied and applied to approximate X by N loading matrices A^(1), , A^(N) where N represents the order of the tensor. We propose a new efficient CP decomposition solver named NeCPD for non-convex problem in multi-way online data based on stochastic gradient descent (SGD) algorithm. SGD is very useful in online setting since it allows us to update X^(t+1) in one single step. In terms of global convergence, it is well known that SGD stuck in many saddle points when it deals with non-convex problems. We study the Hessian matrix to identify theses saddle points, and then try to escape them using the perturbation approach which adds little noise to the gradient update step. We further apply Nesterov's Accelerated Gradient (NAG) method in SGD algorithm to optimally accelerate the convergence rate and compensate Hessian computational delay time per epoch. Experimental evaluation in the field of structural health monitoring using laboratory-based and real-life structural datasets show that our method provides more accurate results compared with existing online tensor analysis methods.
BibTeX:
@article{Anaissi2020,
  author = {Ali Anaissi and Basem Suleiman and Seid Miad Zandavi},
  title = {NeCPD: An Online Tensor Decomposition with Optimal Stochastic Gradient Descent},
  year = {2020}
}
Andrei N (2020), "Nonlinear Conjugate Gradient Methods for Unconstrained Optimization" Springer International Publishing.
Abstract: Two approaches are known for solving large-scale unconstrained optimization problems—the limited-memory quasi-Newton method (truncated Newton method) and the conjugate gradient method. This is the first book to detail conjugate gradient methods, showing their properties and convergence characteristics as well as their performance in solving large-scale unconstrained optimization problems and applications. Comparisons to the limited-memory and truncated Newton methods are also discussed. Topics studied in detail include: linear conjugate gradient methods, standard conjugate gradient methods, acceleration of conjugate gradient methods, hybrid, modifications of the standard scheme, memoryless BFGS preconditioned, and three-term. Other conjugate gradient methods with clustering the eigenvalues or with the minimization of the condition number of the iteration matrix, are also treated. For each method, the convergence analysis, the computational performances and the comparisons versus other conjugate gradient methods are given. \ The theory behind the conjugate gradient algorithms presented as a methodology is developed with a clear, rigorous, and friendly exposition; the reader will gain an understanding of their properties and their convergence and will learn to develop and prove the convergence of his/her own methods. Numerous numerical studies are supplied with comparisons and comments on the behavior of conjugate gradient algorithms for solving a collection of 800 unconstrained optimization problems of different structures and complexities with the number of variables in the range [1000,10000]. The book is addressed to all those interested in developing and using new advanced techniques for solving unconstrained optimization complex problems. Mathematical programming researchers, theoreticians and practitioners in operations research, practitioners in engineering and industry researchers, as well as graduate students in mathematics, Ph.D. and master students in mathematical programming, will find plenty of information and practical applications for solving large-scale unconstrained optimization problems and applications by conjugate gradient methods.
BibTeX:
@book{Andrei2020,
  author = {Neculai Andrei},
  title = {Nonlinear Conjugate Gradient Methods for Unconstrained Optimization},
  publisher = {Springer International Publishing},
  year = {2020},
  doi = {10.1007/978-3-030-42950-8}
}
Angerd A (2020), "Approximation and Compression Techniques to Enhance Performance of Graphics Processing Units". Thesis at: Division of Computer Engineering, Department of Computer Science & Engineering, Chalmers University of Technology.
Abstract: A key challenge in modern computing systems is to access data fast enough to fully utilize the computing elements in the chip. In Graphics Processing Units (GPUs), the performance is often constrained by register file size, memory bandwidth, and the capacity of the main memory. One important technique towards alleviating this challenge is data compression. By reducing the amount of data that needs to be communicated or stored, memory resources crucial for performance can be eciently utilized.\ This thesis provides a set of approximation and compression techniques for GPUs, with the goal of eciently utilizing the computational fabric, and thereby increase performance. The thesis shows that these techniques can substantially lower the amount of information the system has to process, and are thus important tools in the process of meeting challenges in memory utilization. This thesis makes contributions within three areas: controlled floating-point precision reduction, lossless and lossy memory compression, and distributed training of neural networks. In the first area, the thesis shows that through automated and controlled floating-point approximation, the register file can be more eciently utilized. This is achieved through a framework which establishes a cross-layer connection between the application and the microarchitecture layer, and a novel register file organization capable of leveraging low-precision floatingpoint values and narrow integers for increased capacity and performance.\ Within the area of compression, this thesis aims at increasing the effective bandwidth of GPUs by presenting a lossless and lossy memory compression algorithm to reduce the amount of transferred data. In contrast to state-ofthe-art compression techniques such as Base-Delta-Immediate and Bitplane Compression, which uses intra-block bases for compression, the proposed algorithm leverages multiple global base values to reach a higher compression ratio. The algorithm includes an optional approximation step for floating-point values which offers higher compression ratio at a given, low, error rate.\ Finally, within the area of distributed training of neural networks, this thesis proposes a subgraph approximation scheme for graph data which mitigates accuracy loss in a distributed setting. The scheme allows neural network models that use graphs as inputs to converge at single-machine accuracy, while minimizing synchronization overhead between the machines.
BibTeX:
@phdthesis{Angerd2020,
  author = {Alexandra Angerd},
  title = {Approximation and Compression Techniques to Enhance Performance of Graphics Processing Units},
  school = {Division of Computer Engineering, Department of Computer Science & Engineering, Chalmers University of Technology},
  year = {2020},
  url = {https://research.chalmers.se/publication/521610/file/521610_Fulltext.pdf}
}
Angriman E, Predari M, van der Grinten A and Meyerhenke H (2020), "Approximation of the Diagonal of a Laplacian's Pseudoinverse for Complex Network Analysis", June, 2020.
Abstract: The ubiquity of massive graph data sets in numerous applications requires fast algorithms for extracting knowledge from these data. We are motivated here by three electrical measures for the analysis of large small-world graphs G = (V, E) -- i.e., graphs with diameter in O(log |V|), which are abundant in complex network analysis. From a computational point of view, the three measures have in common that their crucial component is the diagonal of the graph Laplacian's pseudoinverse, L^dagger. Computing diag(L^) exactly by pseudoinversion, however, is as expensive as dense matrix multiplication -- and the standard tools in practice even require cubic time. Moreover, the pseudoinverse requires quadratic space -- hardly feasible for large graphs. Resorting to approximation by, e.g., using the Johnson-Lindenstrauss transform, requires the solution of O(log |V| / 2) Laplacian linear systems to guarantee a relative error, which is still very expensive for large inputs. In this paper, we present a novel approximation algorithm that requires the solution of only one Laplacian linear system. The remaining parts are purely combinatorial -- mainly sampling uniform spanning trees, which we relate to diag(L^) via effective resistances. For small-world networks, our algorithm obtains a ± 𝜖-approximation with high probability, in a time that is nearly-linear in |E| and quadratic in 1 / 𝜖. Another positive aspect of our algorithm is its parallel nature due to independent sampling. We thus provide two parallel implementations of our algorithm: one using OpenMP, one MPI + OpenMP. In our experiments against the state of the art, our algorithm (i) yields more accurate results, (ii) is much faster and more memory-efficient, and (iii) obtains good parallel speedups, in particular in the distributed setting.
BibTeX:
@article{Angriman2020,
  author = {Eugenio Angriman and Maria Predari and Alexander van der Grinten and Henning Meyerhenke},
  title = {Approximation of the Diagonal of a Laplacian's Pseudoinverse for Complex Network Analysis},
  year = {2020}
}
Anikin A, Dorn Y and Nesterov Y (2020), "Computational Methods for the Stable Dynamic Model", In Communications in Computer and Information Science. , pp. 280-294. Springer International Publishing.
Abstract: Traffic assignment problem is one of the central problems in transportation science. Various model assumptions lead to different setups corresponding to nonlinear optimization problems.\ In this work, we focus on the stable dynamic model and its generalizations. We propose new equivalent representation for stable dynamic model [Nesterov and de Palma, 2003]. We use smoothing technique to derive new model, which can be interpreted as a stochastic equilibrium model.
BibTeX:
@incollection{Anikin2020,
  author = {Anton Anikin and Yuriy Dorn and Yurii Nesterov},
  title = {Computational Methods for the Stable Dynamic Model},
  booktitle = {Communications in Computer and Information Science},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {280--294},
  doi = {10.1007/978-3-030-38603-0_21}
}
Antonakopoulos K, Belmega EV and Mertikopoulos P (2020), "Online and Stochastic Optimization beyond Lipschitz Continuity: A Riemannian Approach", In Proceedings of the 8th International Conference on Learning Representations.
Abstract: Motivated by applications to machine learning and imaging science, we study a class of online and stochastic optimization problems with loss functions that are not Lipschitz continuous; in particular, the loss functions encountered by the optimizer could exhibit gradient singularities or be singular themselves. Drawing on tools and techniques from Riemannian geometry, we examine a Riemann–Lipschitz (RL) continuity condition which is tailored to the singularity landscape of the problem's loss functions. In this way, we are able to tackle cases beyond the Lipschitz framework provided by a global norm, and we derive optimal regret bounds and last iterate convergence results through the use of regularized learning methods (such as online mirror descent). These results are subsequently validated in a class of stochastic Poisson inverse problems that arise in imaging science.
BibTeX:
@inproceedings{Antonakopoulos2020,
  author = {Kimon Antonakopoulos and E. Veronica Belmega and Panayotis Mertikopoulos},
  title = {Online and Stochastic Optimization beyond Lipschitz Continuity: A Riemannian Approach},
  booktitle = {Proceedings of the 8th International Conference on Learning Representations},
  year = {2020}
}
Anwer AR, Li G, Pattabiraman K, Sullivan M, Tsai T and Hari SKS (2020), "GPU-Trident: Efficient Modeling of ErrorPropagation in GPU Programs", Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis.
Abstract: Fault injection (FI) techniques are typically used to determine the reliability profiles of programs under soft errors. However, these techniques are highly resource- and time-intensive. Prior research developed a model, TRIDENT to analytically predict Silent Data Corruption (SDC, i.e., incorrect output without any indication) probabilities of single-threaded CPU applications without requiring FIs. Unfortunately, TRIDENT is incompatible with GPU programs, due to their high degree of parallelism and different memory architectures than CPU programs. The main challenge is that modeling error propagation across thousands of threads in a GPU kernel requires enormous amounts of data to be profiled and analyzed, posing a major scalability bottleneck for HPC applications.\ In this paper, we propose GPU-TRIDENT, an accurate and scalable technique for modeling error propagation in GPU programs. We find that GPU-TRIDENT is 2 orders of magnitude faster than FI-based approaches, and nearly as accurate in determining the SDC rate of GPU programs.
BibTeX:
@article{Anwer2020,
  author = {Abdul Rehman Anwer and Guanpeng Li and Karthik Pattabiraman and Michael Sullivan and Timothy Tsai and Siva Kumar Sastry Hari},
  title = {GPU-Trident: Efficient Modeling of ErrorPropagation in GPU Programs},
  journal = {Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis},
  year = {2020},
  url = {http://blogs.ubc.ca/karthik/files/2020/08/SC2020-final.pdf}
}
Anzt H, Boman E, Falgout R, Ghysels P, Heroux M, Li X, McInnes LC, Mills RT, Rajamanickam S, Rupp K, Smith B, Yamazaki I and Yang UM (2020), "Preparing sparse solvers for exascale computing", Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences., 1, 2020. Vol. 378(2166), pp. 20190053. The Royal Society.
Abstract: Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges.
BibTeX:
@article{Anzt2020,
  author = {Hartwig Anzt and Erik Boman and Rob Falgout and Pieter Ghysels and Michael Heroux and Xiaoye Li and Lois Curfman McInnes and Richard Tran Mills and Sivasankaran Rajamanickam and Karl Rupp and Barry Smith and Ichitaro Yamazaki and Ulrike Meier Yang},
  title = {Preparing sparse solvers for exascale computing},
  journal = {Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences},
  publisher = {The Royal Society},
  year = {2020},
  volume = {378},
  number = {2166},
  pages = {20190053},
  doi = {10.1098/rsta.2019.0053}
}
Anzt H, Cojean T, Yen-Chen C, Dongarra J, Flegar G, Nayak P, Tomov S, Tsai YM and Wang W (2020), "Load-balancing Sparse Matrix Vector Product Kernels on GPUs", ACM Transactions on Parallel Computing., 3, 2020. Vol. 7(1), pp. 1-26. Association for Computing Machinery (ACM).
Abstract: Efficient processing of Irregular Matrices on Single Instruction, Multiple Data (SIMD)-type architectures is a persistent challenge. Resolving it requires innovations in the development of data formats, computational techniques, and implementations that strike a balance between thread divergence, which is inherent for Irregular Matrices, and padding, which alleviates the performance-detrimental thread divergence but introduces artificial overheads. To this end, in this article, we address the challenge of designing high performance sparse matrix-vector product (SpMV) kernels designed for Nvidia Graphics Processing Units (GPUs). We present a compressed sparse row (CSR) format suitable for unbalanced matrices. We also provide a load-balancing kernel for the coordinate (COO) matrix format and extend it to a hybrid algorithm that stores part of the matrix in SIMD-friendly Ellpack format (ELL) format. The ratio between the ELL- and the COO-part is determined using a theoretical analysis of the nonzeros-per-row distribution. For the over 2,800 test matrices available in the Suite Sparse matrix collection, we compare the performance against SpMV kernels provided by NVIDIA's cuSPARSE library and a heavily-tuned sliced ELL (SELL-P) kernel that prevents unnecessary padding by considering the irregular matrices as a combination of matrix blocks stored in ELL format.
BibTeX:
@article{Anzt2020a,
  author = {Hartwig Anzt and Terry Cojean and Chen Yen-Chen and Jack Dongarra and Goran Flegar and Pratik Nayak and Stanimire Tomov and Yuhsiang M. Tsai and Weichung Wang},
  title = {Load-balancing Sparse Matrix Vector Product Kernels on GPUs},
  journal = {ACM Transactions on Parallel Computing},
  publisher = {Association for Computing Machinery (ACM)},
  year = {2020},
  volume = {7},
  number = {1},
  pages = {1--26},
  doi = {10.1145/3380930}
}
Anzt H, Cojean T, Flegar G, Göbel F, Grützmacher T, Nayak P, Ribizel T, Tsai YM and Quintana-Ortí ES (2020), "Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing", June, 2020.
Abstract: In this paper, we present Ginkgo, a modern C++ math library for scientific high performance computing. While classical linear algebra libraries act on matrix and vector objects, Ginkgo's design principle abstracts all functionality as "linear operators", motivating the notation of a "linear operator algebra library". Ginkgo's current focus is oriented towards providing sparse linear algebra functionality for high performance GPU architectures, but given the library design, this focus can be easily extended to accommodate other algorithms and hardware architectures. We introduce this sophisticated software architecture that separates core algorithms from architecture-specific back ends and provide details on extensibility and sustainability measures. We also demonstrate Ginkgo's usability by providing examples on how to use its functionality inside the MFEM and deal.ii finite element ecosystems. Finally, we offer a practical demonstration of Ginkgo's high performance on state-of-the-art GPU architectures.
BibTeX:
@article{Anzt2020b,
  author = {Hartwig Anzt and Terry Cojean and Goran Flegar and Fritz Göbel and Thomas Grützmacher and Pratik Nayak and Tobias Ribizel and Yuhsiang Mike Tsai and Enrique S. Quintana-Ortí},
  title = {Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing},
  year = {2020}
}
Anzt H, Cojean T, Chen Y-C, Flegar G, Göbel F, Grützmacher T, Nayak P, Ribizel T and Tsai Y-H (2020), "Ginkgo: A high performance numerical linear algebra library", The Journal of Open Source Software.
Abstract: Ginkgo is a production-ready sparse linear algebra library for high performance computing on GPU-centric architectures with a high level of performance portability and focuses on software sustainability.\ The library focuses on solving sparse linear systems and accommodates a large variety of matrix formats, state-of-the-art iterative (Krylov) solvers and preconditioners, which make the library suitable for a variety of scientific applications. Ginkgo supports many architectures such as multi-threaded CPU, NVIDIA GPUs, and AMD GPUs. The heavy use of modern C++ features simplifies the addition of new executor paradigms and algorithmic functionality without introducing significant performance overhead.\ Solving linear systems is usually one of the most computationally and memory intensive aspects of any application. Hence there has been a significant amount of effort in this direction with software libraries such as UMFPACK and CHOLMOD (“Suitesparse,” 2020) for solving linear systems with direct methods and PETSc (“PETSc,” 2020), Trilinos (“Trilinos,” 2020), Eigen (“Eigen,” 2020) and many more to solve linear systems with iterative methods. With Ginkgo, we aim to ensure high performance while not compromising portability. Hence, we provide very efficient low level kernels optimized for different architectures and separate these kernels from the algorithms thereby ensuring extensibility and ease of use.\ Ginkgo is also a part of the xSDK effort (“xSDK,” 2020) and available as a Spack (Gamblin et al., 2015) package. xSDK aims to provide infrastructure for and interoperability between a collection of related and complementary software elements to foster rapid and efficient development of scientific applications using High Performance Computing. Within this effort, we provide interoperability with application libraries such as deal.ii (Arndt et al., 2019) and mfem (Anderson et al., 2020). Ginkgo provides wrappers within these two libraries so that they can take advantage of the features of Ginkgo
BibTeX:
@article{Anzt2020c,
  author = {Hartwig Anzt and Terry Cojean and Yen-Chen Chen and Goran Flegar and Fritz Göbel and Thomas Grützmacher and Pratik Nayak and Tobias Ribizel and Yu-Hsiang Tsai},
  title = {Ginkgo: A high performance numerical linear algebra library},
  journal = {The Journal of Open Source Software},
  year = {2020},
  doi = {10.21105/joss.02260}
}
Anzt H, Tsai YM, Abdelfattah A, Cojean T and Dongarra J (2020), "Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse and Batched Computations", In Proceedings of the 2020 IEEE/ACM Conference on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems.
Abstract: GPU accelerators have become an important backbone for scientific high performance-computing, and the performance advances obtained from adopting new GPU hardware are significant. In this paper we take a first look at NVIDIA's newest server-line GPU, the A100 architecture, part of the Ampere generation. Specifically, we assess its performance for sparse and batch computations, as these routines are relied upon in many scientific applications, and compare to the performance achieved on NVIDIA's previous server-line GPU.
BibTeX:
@inproceedings{Anzt2020d,
  author = {Hartwig Anzt and Yuhsiang M. Tsai and Ahmad Abdelfattah and Terry Cojean and Jack Dongarra},
  title = {Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse and Batched Computations},
  booktitle = {Proceedings of the 2020 IEEE/ACM Conference on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems},
  year = {2020}
}
Anzt H, Kuehn E and Flegar G (2020), "Crediting Pull Requests to Open Source Research Software as an Academic Contribution", Journal of Computational Science. , pp. 101278.
Abstract: Like any other scientific discipline, the High Performance Computing community suffers under the publish or perish paradigm. As a result, a significant portion of novel algorithm designs and hardware-optimized implementations never make it into production code but are instead abandoned once they served the purpose of yielding (another) publication. At the same time, community software packages driving scientific research lack the addition of new technology and hardware-specific implementations. This results in a very unsatisfying situation where researchers and software developers are working independently, and the traditional peer reviewing is reaching its capacity limits. A paradigm shift that accepts high-quality software pull requests to open source research software as conference contributions may create incentives to realize new and/or improved algorithms in community software ecosystems. In this paper, we propose to complement code reviews on pull requests to scientific open source software with scientific reviews, and allow the presentation and publication of high quality software contributions that present an academic improvement to the state-of-the-art at scientific conferences.
BibTeX:
@article{Anzt2020e,
  author = {Hartwig Anzt and Eileen Kuehn and Goran Flegar},
  title = {Crediting Pull Requests to Open Source Research Software as an Academic Contribution},
  journal = {Journal of Computational Science},
  year = {2020},
  pages = {101278},
  url = {http://www.sciencedirect.com/science/article/pii/S1877750320305743},
  doi = {10.1016/j.jocs.2020.101278}
}
Archibald R, Chow E, D'Azevedo E, Dongarra J, Eisenbach M, Febbo R, Lopez F, Nichols D, Tomov S, Wong K and Yin J (2020), "Integrating Deep Learning in Domain Sciencesat Exascale". Thesis at: University of Tennessee.
Abstract: This paper presents some of the current challenges in designing deep learning artificial intelligence (AI) and integrating it with traditional high-performance computing (HPC) simulations. We evaluate existing packages for their ability to run deep learning models and applications on large-scale HPC systems eciently, identify challenges, and propose new asynchronous parallelization and optimization techniques for current large-scale heterogeneous systems and upcoming exascale systems. These developments, along with existing HPC AI software capabilities, have been integrated into MagmaDNN, an open-source HPC deep learning framework. Many deep learning frameworks are targeted at data scientists and fall short in providing quality integration into existing HPC workflows. This paper discusses the necessities of an HPC deep learning framework and how those needs can be provided (e.g., as in MagmaDNN) through a deep integration with existing HPC libraries, such as MAGMA and its modular memory management, MPI, CuBLAS, CuDNN, MKL, and HIP. Advancements are also illustrated through the use of algorithmic enhancements in reduced- and mixed-precision, as well as asynchronous optimization methods. Finally, we present illustrations and potential solutions for enhancing traditional compute- and data-intensive applications at ORNL and UTK with AI. The approaches and future challenges are illustrated in materials science, imaging, and climate applications.
BibTeX:
@techreport{Archibald2020,
  author = {Rick Archibald and Edmond Chow and Eduardo D'Azevedo and Jack Dongarra and Markus Eisenbach and Rocco Febbo and Florent Lopez and Daniel Nichols and Stanimire Tomov and Kwai Wong and Junqi Yin},
  title = {Integrating Deep Learning in Domain Sciencesat Exascale},
  school = {University of Tennessee},
  year = {2020},
  url = {https://www.icl.utk.edu/files/publications/2020/icl-utk-1403-2020.pdf}
}
Asgari B, Hadidi R, Dierberger J, Steinichen C and Kim H (2020), "Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads", November, 2020.
Abstract: Sparse matrices are the key ingredients of several application domains, from scientific computation to machine learning. The primary challenge with sparse matrices has been efficiently storing and transferring data, for which many sparse formats have been proposed to significantly eliminate zero entries. Such formats, essentially designed to optimize memory footprint, may not be as successful in performing faster processing. In other words, although they allow faster data transfer and improve memory bandwidth utilization -- the classic challenge of sparse problems -- their decompression mechanism can potentially create a computation bottleneck. Not only is this challenge not resolved, but also it becomes more serious with the advent of domain-specific architectures (DSAs), as they intend to more aggressively improve performance. The performance implications of using various formats along with DSAs, however, has not been extensively studied by prior work. To fill this gap of knowledge, we characterize the impact of using seven frequently used sparse formats on performance, based on a DSA for sparse matrix-vector multiplication (SpMV), implemented on an FPGA using high-level synthesis (HLS) tools, a growing and popular method for developing DSAs. Seeking a fair comparison, we tailor and optimize the HLS implementation of decompression for each format. We thoroughly explore diverse metrics, including decompression overhead, latency, balance ratio, throughput, memory bandwidth utilization, resource utilization, and power consumption, on a variety of real-world and synthetic sparse workloads.
BibTeX:
@article{Asgari2020,
  author = {Bahar Asgari and Ramyad Hadidi and Joshua Dierberger and Charlotte Steinichen and Hyesoon Kim},
  title = {Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads},
  year = {2020}
}
Ashcraft C, Buttari A, Mary ThéoAshcraft C, Buttari A and Mary T (2020), "Block Low-Rank Matrices with Shared Bases: Potential and Limitations of the BLR^2 Format". Thesis at: Institut de recherche en informatique de Toulouse (IRIT).
Abstract: We investigate a special class of data sparse rank-structured matrices that combine a flat block low-rank (BLR) partitioning with the use of shared (called nested in the hierarchical case) bases. This format is to ℋ^2 matrices what BLR is to ℋ matrices: we therefore call it the BLR2 matrix format. We present algorithms for the construction and LU factorization of BLR2 matrices, and perform their cost analysis—both asymptotically and for a fixed problem size. With weak admissibility, BLR^2 matrices reduce to block separable matrices (the flat version of HBS/HSS). Our analysis and numerical experiments reveal some limitations of BLR^2 matrices with weak admissibility, which we propose to overcome with two approaches: strong admissibility, and the use of multiple shared bases per row and column.
BibTeX:
@techreport{Ashcraft2020,
  author = {Ashcraft, Cleve and Buttari, Alfredo and Mary, ThéoAshcraft, Cleve and Buttari, Alfredo and Mary, Théo},
  title = {Block Low-Rank Matrices with Shared Bases: Potential and Limitations of the BLR^2 Format},
  school = {Institut de recherche en informatique de Toulouse (IRIT)},
  year = {2020},
  url = {https://hal.archives-ouvertes.fr/hal-03070416}
}
Attouch H, Chbani Z and Riahi H (2020), "Fast Convex Optimization Via a Third-Order In Time Evolution Equation"
Abstract: In a Hilbert space H, we develop fast convex optimization methods, which are based on a third order in time evolution system. The function to minimize f : ℋ → ℝ is convex, continuously differentiable, with argmin f ≠ ∅, and enters the dynamic via its gradient. On the basis of Lyapunov's analysis and temporal scaling techniques, we show a convergence rate of the values of the order 1t^3 , and obtain the convergence of the trajectories towards optimal solutions. When f is strongly convex, an exponential rate of convergence is obtained. We complete the study of the continuous dynamic by introducing a damping term induced by the Hessian of f. This allows the oscillations to be controlled and attenuated. Then, we analyze the convergence of the proximal-based algorithms obtained by temporal discretization of this system, and obtain similar convergence rates. The algorithmic results are valid for a general convex, lower semicontinuous, and proper function f : ℋ → ℝ ∪ +\infty \.
BibTeX:
@article{Attouch2020,
  author = {Hedy Attouch and Zaki Chbani and Hassan Riahi},
  title = {Fast Convex Optimization Via a Third-Order In Time Evolution Equation},
  year = {2020}
}
Awan MG, Deslippe J, Buluc A, Selvitopi O, Hofmeyr S, Oliker L and Yelick K (2020), "ADEPT: a domain independent sequence alignment strategy for gpu architectures", BMC Bioinformatics., 9, 2020. Vol. 21(1) Springer Science and Business Media LLC.
Abstract: Bioinformatic workflows frequently make use of automated genome assembly and protein clustering tools. At the core of most of these tools, a significant portion of execution time is spent in determining optimal local alignment between two sequences. This task is performed with the Smith-Waterman algorithm, which is a dynamic programming based method. With the advent of modern sequencing technologies and increasing size of both genome and protein databases, a need for faster Smith-Waterman implementations has emerged. Multiple SIMD strategies for the Smith-Waterman algorithm are available for CPUs. However, with the move of HPC facilities towards accelerator based architectures, a need for an efficient GPU accelerated strategy has emerged. Existing GPU based strategies have either been optimized for a specific type of characters (Nucleotides or Amino Acids) or for only a handful of application use-cases.
BibTeX:
@article{Awan2020,
  author = {Muaaz G. Awan and Jack Deslippe and Aydin Buluc and Oguz Selvitopi and Steven Hofmeyr and Leonid Oliker and Katherine Yelick},
  title = {ADEPT: a domain independent sequence alignment strategy for gpu architectures},
  journal = {BMC Bioinformatics},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  volume = {21},
  number = {1},
  doi = {10.1186/s12859-020-03720-1}
}
Awwal AM, Kumam P and Mohammad H (2020), "Iterative algorithm with structured diagonal Hessian approximation for solving nonlinear least squares problems", February, 2020.
Abstract: Nonlinear least-squares problems are a special class of unconstrained optimization problems in which their gradient and Hessian have special structures. In this paper, we exploit these structures and proposed a matrix-free algorithm with a diagonal Hessian approximation for solving nonlinear least-squares problems. We devise appropriate safeguarding strategies to ensure the Hessian matrix is positive definite throughout the iteration process. The proposed algorithm generates descent direction and is globally convergent. Preliminary numerical experiments show that the proposed method is competitive with a recently developed similar method.
BibTeX:
@article{Awwal2020,
  author = {Aliyu Muhammed Awwal and Poom Kumam and Hassan Mohammad},
  title = {Iterative algorithm with structured diagonal Hessian approximation for solving nonlinear least squares problems},
  year = {2020}
}
Ayala A, Tomov S, Haidar A and Dongarra J (2020), "heFFTe: Highly Efficient FFT for Exascale", In Lecture Notes in Computer Science. , pp. 262-275. Springer International Publishing.
Abstract: Exascale computing aspires to meet the increasing demands from large scientific applications. Software targeting exascale is typically designed for heterogeneous architectures; henceforth, it is not only important to develop well-designed software, but also make it aware of the hardware architecture and efficiently exploit its power. Currently, several and diverse applications, such as those part of the Exascale Computing Project (ECP) in the United States, rely on efficient computation of the Fast Fourier Transform (FFT). In this context, we present the design and implementation of heFFTe (Highly Efficient FFT for Exascale) library, which targets the upcoming exascale supercomputers. We provide highly (linearly) scalable GPU kernels that achieve more than equation 40× speedup with respect to local kernels from CPU state-of-the-art libraries, and over equation 2× speedup for the whole FFT computation. A communication model for parallel FFTs is also provided to analyze the bottleneck for large-scale problems. We show experiments obtained on Summit supercomputer at Oak Ridge National Laboratory, using up to 24,576 IBM Power9 cores and 6,144 NVIDIA V-100 GPUs.
BibTeX:
@incollection{Ayala2020,
  author = {Alan Ayala and Stanimire Tomov and Azzam Haidar and Jack Dongarra},
  title = {heFFTe: Highly Efficient FFT for Exascale},
  booktitle = {Lecture Notes in Computer Science},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {262--275},
  doi = {10.1007/978-3-030-50371-0_19}
}
Azad A, Buluç A, Li XS, Wang X and Langguth J (2020), "A Distributed-Memory Algorithm for Computing a Heavy-Weight Perfect Matching on Bipartite Graphs", SIAM Journal on Scientific Computing., 1, 2020. Vol. 42(4), pp. C143-C168. Society for Industrial & Applied Mathematics (SIAM).
Abstract: We design and implement an efficient parallel algorithm for finding a perfect matching in a weighted bipartite graph such that weights on the edges of the matching are large. This problem differs from the maximum weight matching problem, for which scalable approximation algorithms are known. It is primarily motivated by finding good pivots in scalable sparse direct solvers before factorization. Due to the lack of scalable alternatives, distributed solvers use sequential implementations of maximum weight perfect matching algorithms, such as those available in MC64. To overcome this limitation, we propose a fully parallel distributed memory algorithm that first generates a perfect matching and then iteratively improves the weight of the perfect matching by searching for weight-increasing cycles of length 4 in parallel. For most practical problems the weights of the perfect matchings generated by our algorithm are very close to the optimum. An efficient implementation of the algorithm scales up to 256 nodes (17,408 cores) on a Cray XC40 supercomputer and can solve instances that are too large to be handled by a single node using the sequential algorithm.
BibTeX:
@article{Azad2020,
  author = {Ariful Azad and Aydin Buluç and Xiaoye S. Li and Xinliang Wang and Johannes Langguth},
  title = {A Distributed-Memory Algorithm for Computing a Heavy-Weight Perfect Matching on Bipartite Graphs},
  journal = {SIAM Journal on Scientific Computing},
  publisher = {Society for Industrial & Applied Mathematics (SIAM)},
  year = {2020},
  volume = {42},
  number = {4},
  pages = {C143--C168},
  doi = {10.1137/18m1189348}
}
Azad A, Aznaveh MM, Beamer S, Blanco M, Chen J, D'Alessandro L, Dathathri R, Davis T, Deweese K, Firoz J, Gabb HA, Gill G, Hegyi B, Kolodzie S, Low TM, Lumsdaine A, Manlaibaatar T, Mattson TG, McMillan S, Peri R, Pingali K, Sridhar U, Szarnyas G, Zhang Y and Zhang Y (2020), "Evaluation of Graph Analytics Frameworks Using the GAP Benchmark Suite,", In Proceedings of the 2020 IEEE International Symposium on Workload Characterization.
Abstract: Graphs play a key role in data analytics. Graphs and the software systems used to work with them are highly diverse. Algorithms interact with hardware in different ways and which graph solution works best on a given platform changes with the structure of the graph. This makes it difficult to decide which graph programming framework is the best for a given situation. In this paper, we try to make sense of this diverse landscape. We evaluate five different frameworks for graph analytics: SuiteSparse GraphBLAS, Galois, the NWGraph library, the Graph Kernel Collection (GKC), and GraphIt. We use the GAP Benchmark Suite to evaluate each framework. GAP consists of 30 tests: six graph algorithms (breadth-first search, single-source shortest path, PageRank, betweenness centrality, connected components, and triangle counting) on five graphs. The GAP Benchmark Suite includes high-performance reference implementations to provide a performance baseline for comparison. Our results show the relative strengths of each framework, but also serve as a case study for the challenges of establishing objective measures for comparing graph frameworks.
BibTeX:
@inproceedings{Azad2020a,
  author = {Ariful Azad and Mohsen Mahmoudi Aznaveh and Scott Beamer and Mark Blanco and Jinhao Chen and Luke D'Alessandro and Roshan Dathathri and Tim Davis and Kevin Deweese and Jesun Firoz and Henry A Gabb and Gurbinder Gill and Balint Hegyi and Scott Kolodzie and Tze Meng Low and Andrew Lumsdaine and Tugsbayasgalan Manlaibaatar and Timothy G Mattson and Scott McMillan and Ramesh Peri and Keshav Pingali and Upasana Sridhar and Gabor Szarnyas and Yunming Zhang and Yongzhe Zhang},
  title = {Evaluation of Graph Analytics Frameworks Using the GAP Benchmark Suite,},
  booktitle = {Proceedings of the 2020 IEEE International Symposium on Workload Characterization},
  year = {2020},
  url = {https://www.cs.utexas.edu/ roshan/GraphAnalyticsFrameworksStudy.pdf}
}
Baayen J and Marecek J (2020), "Mixed-Integer Path-Stable Optimisation, with Applications in Model-Predictive Control of Water Systems", January, 2020.
Abstract: Many systems exhibit a mixture of continuous and discrete dynamics. We consider a family of mixed-integer non-convex non-linear optimisation problems obtained in discretisations of optimal control of such systems. For this family, a branch-and-bound algorithm solves the discretised problem to global optimality. As an example, we consider water systems, where variations in flow and variations in water levels are continuous, while decisions related to fixed-speed pumps and whether gates that may be opened and closed are discrete. We show that the related optimal-control problems come from the family we introduce -- and implement deterministic solvers with global convergence guarantees.
BibTeX:
@article{Baayen2020,
  author = {Jorn Baayen and Jakub Marecek},
  title = {Mixed-Integer Path-Stable Optimisation, with Applications in Model-Predictive Control of Water Systems},
  year = {2020}
}
Barik R, Minutoli M, Halappanavar M, Tallent NR and Kalyanaraman A (2020), "Vertex Reordering for Real-World Graphs and Applications: An Empirical Evaluation", In Proceedings of the International Symposium on Workload Characterization., 10, 2020. IEEE.
Abstract: Vertex reordering is a way to improve locality in graph computations. Given an input (or “natural”) order, reordering aims to compute an alternate permutation of the vertices that is aimed at maximizing a locality-based objective. Given decades of research on this topic, there are tens of graph reordering schemes, and there are also several linear arrangement “gap” measures for treatment as objectives. However, a comprehensive empirical analysis of the efficacy of the ordering schemes against the different gap measures, and against real-world applications is currently lacking. In this study, we present an extensive empirical evaluation of up to 11 ordering schemes, taken from different classes of approaches, on a set of 34 real-world graphs emerging from different application domains. Our study is presented in two parts: a) a thorough comparative evaluation of the different ordering schemes on their effectiveness to optimize different linear arrangement gap measures, relevant to preserving locality; and b) extensive evaluation of the impact of the ordering schemes on two real-world, parallel graph applications, namely, community detection and influence maximization. Our studies show a significant divergence among the ordering schemes (up to 40x between the best and the poor) in their effectiveness to reduce the gap measures; and a wide ranging impact of the ordering schemes on various aspects including application runtime (up to 4x), memory and cache use, load balancing, and parallel work and efficiency. The comparative study also helps in revealing the nuances of a parallel environment (compared to serial) on the ordering schemes and their role in optimizing applications.
BibTeX:
@inproceedings{Barik2020,
  author = {Reet Barik and Marco Minutoli and Mahantesh Halappanavar and Nathan R. Tallent and Ananth Kalyanaraman},
  title = {Vertex Reordering for Real-World Graphs and Applications: An Empirical Evaluation},
  booktitle = {Proceedings of the International Symposium on Workload Characterization},
  publisher = {IEEE},
  year = {2020},
  doi = {10.1109/iiswc50251.2020.00031}
}
Bauch J and Nadler B (2020), "Rank 2r iterative least squares: efficient recovery of ill-conditioned low rank matrices from few entries", February, 2020.
Abstract: We present a new, simple and computationally efficient iterative method for low rank matrix completion. Our method is inspired by the class of factorization-type iterative algorithms, but substantially differs from them in the way the problem is cast. Precisely, given a target rank r, instead of optimizing on the manifold of rank r matrices, we allow our interim estimated matrix to have a specific over-parametrized rank 2r structure. Our algorithm, denoted R2RILS, for rank 2r iterative least squares, thus has low memory requirements, and at each iteration it solves a computationally cheap sparse least-squares problem. We motivate our algorithm by its theoretical analysis for the simplified case of a rank-1 matrix. Empirically, R2RILS is able to recover, with machine precision, ill conditioned low rank matrices from very few observations -- near the information limit. Finally, R2RILS is stable to corruption of the observed entries by additive zero mean Gaussian noise.
BibTeX:
@article{Bauch2020,
  author = {Jonathan Bauch and Boaz Nadler},
  title = {Rank 2r iterative least squares: efficient recovery of ill-conditioned low rank matrices from few entries},
  year = {2020}
}
Beeumen RV, Ibrahim KZ, Kahanamoku-Meyer GD, Yao NY and Yang C (2020), "Enhancing Scalability of a Matrix-Free Eigensolver for Studying Many-Body Localization", December, 2020.
Abstract: In [Van Beeumen, et. al, HPC Asia 2020, https://www.doi.org/10.1145/3368474.3368497] a scalable and matrix-free eigensolver was proposed for studying the many-body localization (MBL) transition of two-level quantum spin chain models with nearest-neighbor XX+YY interactions plus Z terms. This type of problem is computationally challenging because the vector space dimension grows exponentially with the physical system size, and averaging over different configurations of the random disorder is needed to obtain relevant statistical behavior. For each eigenvalue problem, eigenvalues from different regions of the spectrum and their corresponding eigenvectors need to be computed. Traditionally, the interior eigenstates for a single eigenvalue problem are computed via the shift-and-invert Lanczos algorithm. Due to the extremely high memory footprint of the LU factorizations, this technique is not well suited for large number of spins L, e.g., one needs thousands of compute nodes on modern high performance computing infrastructures to go beyond L = 24. The matrix-free approach does not suffer from this memory bottleneck, however, its scalability is limited by a computation and communication imbalance. We present a few strategies to reduce this imbalance and to significantly enhance the scalability of the matrix-free eigensolver. To optimize the communication performance, we leverage the consistent space runtime, CSPACER, and show its efficiency in accelerating the MBL irregular communication patterns at scale compared to optimized MPI non-blocking two-sided and one-sided RMA implementation variants. The efficiency and effectiveness of the proposed algorithm is demonstrated by computing eigenstates on a massively parallel many-core high performance computer.
BibTeX:
@article{Beeumen2020,
  author = {Roel Van Beeumen and Khaled Z. Ibrahim and Gregory D. Kahanamoku-Meyer and Norman Y. Yao and Chao Yang},
  title = {Enhancing Scalability of a Matrix-Free Eigensolver for Studying Many-Body Localization},
  year = {2020}
}
Bellavia S and Gurioli G (2020), "Complexity Analysis of a Stochastic Cubic Regularisation Method under Inexact Gradient Evaluations and Dynamic Hessian Accuracy", January, 2020.
Abstract: We here adapt an extended version of the adaptive cubic regularisation method with dynamic inexact Hessian information for nonconvex optimisation in [2] to the stochastic optimisation setting. While exact function evaluations are still considered, this novel variant inherits the innovative use of adaptive accuracy requirements for Hessian approximations introduced in [2] and additionally employs inexact computations of the gradient. Without restrictions on the variance of the errors, we assume that these approximations are available within a sufficiently large, but fixed, probability and we extend, in the spirit of [13], the deterministic analysis of the framework to its stochastic counterpart, showing that the expected number of iterations to reach a first-order stationary point matches the well known worst-case optimal complexity. This is, in fact, still given by O(epsilon^(-3/2)), with respect to the first-order epsilon tolerance.
BibTeX:
@article{Bellavia2020,
  author = {Stefania Bellavia and Gianmarco Gurioli},
  title = {Complexity Analysis of a Stochastic Cubic Regularisation Method under Inexact Gradient Evaluations and Dynamic Hessian Accuracy},
  year = {2020}
}
Bemporad A and Cimini G (2020), "Reduction of the Number of Variables in Parametric Constrained Least-Squares Problems", December, 2020.
Abstract: For linearly constrained least-squares problems that depend on a vector of parameters, this paper proposes techniques for reducing the number of involved optimization variables. After first eliminating equality constraints in a numerically robust way by QR factorization, we propose a technique based on singular value decomposition (SVD) and unsupervised learning, that we call K-SVD, and neural classifiers to automatically partition the set of parameter vectors in K nonlinear regions in which the original problem is approximated by using a smaller set of variables. For the special case of parametric constrained least-squares problems that arise from model predictive control (MPC) formulations, we propose a novel and very efficient QR factorization method for equality constraint elimination. Together with SVD or K-SVD, the method provides a numerically robust alternative to standard condensing and move blocking, and to other complexity reduction methods for MPC based on basis functions. We show the good performance of the proposed techniques in numerical tests and in a linearized MPC problem of a nonlinear benchmark process.
BibTeX:
@article{Bemporad2020,
  author = {Alberto Bemporad and Gionata Cimini},
  title = {Reduction of the Number of Variables in Parametric Constrained Least-Squares Problems},
  year = {2020}
}
Bereznyi D, Qutbuddin A, Her Y and Yang K (2020), "Node-attributed Spatial Graph Partitioning", In Proceedings of the 28th International Conference on Advances in Geographic Information Systems.
Abstract: Given a spatial graph and a set of node attributes, the Node-attributed Spatial Graph Partitioning (NSGP) problem partitions a node-attributed spatial graph into k homogeneous sub-graphs that minimize both the total RMSErank1 and edge-cuts while meeting a size constraint on the sub-graphs. RMSErank1 is the Root Mean Square Error between a matrix and its rank-one decomposition. The NSGP problem is important for many societal applications such as identifying homogeneous communities in a spatial graph and detecting interrelated patterns in traffic accidents. This problem is NP-hard; it is computationally challenging because of the large size of spatial graphs and the constraint that the sub-graphs must be homogeneous, i.e. similar in terms of node attributes. This paper proposes a novel approach for finding a set of homogeneous sub-graphs that can minimize both the total RMSErank1 and edge-cuts while meeting the size constraint. Experiments and a case study using U.S. Census datasets and HP#6 watershed network datasets demonstrate that the proposed approach partitions a spatial graph into a set of homogeneous sub-graphs and reduces the computational cost.
BibTeX:
@inproceedings{Bereznyi2020,
  author = {Daniel Bereznyi and Ahmad Qutbuddin and YoungGu Her and KwangSoo Yang},
  title = {Node-attributed Spatial Graph Partitioning},
  booktitle = {Proceedings of the 28th International Conference on Advances in Geographic Information Systems},
  year = {2020},
  doi = {10.1145/3397536.3422198}
}
Bergamaschi L, Marin J and Martinez A (2020), "Compact Quasi-Newton preconditioners for SPD linear systems", January, 2020.
Abstract: In this paper preconditioners for the Conjugate Gradient method are studied to solve the Newton system with symmetric positive definite Jacobian. In particular, we define a sequence of preconditioners built by means of SR1 and BFGS low-rank updates. We develop conditions under which the SR1 update maintains the preconditioner SPD. Spectral analysis of the SR1 preconditioned Jacobians shows an improved eigenvalue distribution as the Newton iteration proceeds. A compact matrix formulation of the preconditioner update is developed which reduces the cost of its application and is more suitable for parallel implementation. Some notes on the implementation of the corresponding Inexact Newton method are given and numerical results on a number of model problems illustrate the efficiency of the proposed preconditioners.
BibTeX:
@article{Bergamaschi2020,
  author = {Luca Bergamaschi and Jose Marin and Angeles Martinez},
  title = {Compact Quasi-Newton preconditioners for SPD linear systems},
  year = {2020}
}
Bergamaschi L (2020), "A Survey of Low-Rank Updates of Preconditioners for Sequences of Symmetric Linear Systems", Algorithms., 4, 2020. Vol. 13(4), pp. 100. MDPI AG.
Abstract: The aim of this survey is to review some recent developments in devising efficient preconditioners for sequences of symmetric positive definite (SPD) linear systems A_k x_k = b_k,, k = 1, dots arising in many scientific applications, such as discretization of transient Partial Differential Equations (PDEs), solution of eigenvalue problems, (Inexact) Newton methods applied to nonlinear systems, rational Krylov methods for computing a function of a matrix. In this paper, we will analyze a number of techniques of updating a given initial preconditioner by a low-rank matrix with the aim of improving the clustering of eigenvalues around 1, in order to speed-up the convergence of the Preconditioned Conjugate Gradient (PCG) method. We will also review some techniques to efficiently approximate the linearly independent vectors which constitute the low-rank corrections and whose choice is crucial for the effectiveness of the approach. Numerical results on real-life applications show that the performance of a given iterative solver can be very much enhanced by the use of low-rank updates.
BibTeX:
@article{Bergamaschi2020a,
  author = {Luca Bergamaschi},
  title = {A Survey of Low-Rank Updates of Preconditioners for Sequences of Symmetric Linear Systems},
  journal = {Algorithms},
  publisher = {MDPI AG},
  year = {2020},
  volume = {13},
  number = {4},
  pages = {100},
  doi = {10.3390/a13040100}
}
Berger GO, Absil PA, Jungers RM and Nesterov Y (2020), "On the Quality of First-Order Approximation of Functions with Hölder Continuous Gradient", January, 2020.
Abstract: We show that Hölder continuity of the gradient is not only a sufficient condition, but also a necessary condition for the existence of a global upper bound on the error of the first-order Taylor approximation. We also relate this global upper bound to the Hölder constant of the gradient. This relation is expressed as an interval, depending on the Hölder constant, in which the error of the first-order Taylor approximation is guaranteed to be. We show that, for the Lipschitz continuous case, the interval cannot be reduced. An application to the norms of quadratic forms is proposed, which allows us to derive a novel characterization of Euclidean norms.
BibTeX:
@article{Berger2020,
  author = {Guillaume O. Berger and P. -A. Absil and Raphaël M. Jungers and Yurii Nesterov},
  title = {On the Quality of First-Order Approximation of Functions with Hölder Continuous Gradient},
  year = {2020}
}
Bergou E, Diouane Y and Kungurtsev V (2020), "Convergence and Complexity Analysis of a Levenberg-Marquardt Algorithm for Inverse Problems", JOTA, 2020., April, 2020.
Abstract: The Levenberg-Marquardt algorithm is one of the most popular algorithms for finding the solution of nonlinear least squares problems. Across different modified variations of the basic procedure, the algorithm enjoys global convergence, a competitive worst case iteration complexity rate, and a guaranteed rate of local convergence for both zero and nonzero small residual problems, under suitable assumptions. We introduce a novel Levenberg-Marquardt method that matches, simultaneously, the state of the art in all of these convergence properties with a single seamless algorithm. Numerical experiments confirm the theoretical behavior of our proposed algorithm.
BibTeX:
@article{Bergou2020,
  author = {E. Bergou and Y. Diouane and V. Kungurtsev},
  title = {Convergence and Complexity Analysis of a Levenberg-Marquardt Algorithm for Inverse Problems},
  journal = {JOTA, 2020},
  year = {2020}
}
Bernaschi M, D'Ambra P and Pasquini D (2020), "BootCMatchG: An adaptive Algebraic MultiGrid linear solver for GPUs", Software Impacts., November, 2020. , pp. 100041. Elsevier BV.
Abstract: Sparse solvers are one of the building blocks of any technology for reliable and high-performance scientific and engineering computing. In this paper we present a software package which implements an efficient multigrid sparse solver running on Graphics Processing Units. The package is a branch of a wider initiative of software development for sparse Linear Algebra computations on emergent HPC architectures involving a large research group working in many application projects over the last ten years.
BibTeX:
@article{Bernaschi2020,
  author = {Massimo Bernaschi and Pasqua D'Ambra and Dario Pasquini},
  title = {BootCMatchG: An adaptive Algebraic MultiGrid linear solver for GPUs},
  journal = {Software Impacts},
  publisher = {Elsevier BV},
  year = {2020},
  pages = {100041},
  doi = {10.1016/j.simpa.2020.100041}
}
Bertsimas D, Cory-Wright R and Pauphilet J (2020), "Solving Large-Scale Sparse PCA to Certifiable (Near) Optimality", May, 2020.
Abstract: Sparse principal component analysis (PCA) is a popular dimensionality reduction technique for obtaining principal components which are linear combinations of a small subset of the original features. Existing approaches cannot supply certifiably optimal principal components with more than p=100s covariates. By reformulating sparse PCA as a convex mixed-integer semidefinite optimization problem, we design a cutting-plane method which solves the problem to certifiable optimality at the scale of selecting k=10s covariates from p=300 variables, and provides small bound gaps at a larger scale. We also propose two convex relaxations and randomized rounding schemes that provide certifiably near-exact solutions within minutes for p=100s or hours for p=1,000s. Using real-world financial and medical datasets, we illustrate our approach's ability to derive interpretable principal components tractably at scale.
BibTeX:
@article{Bertsimas2020,
  author = {Dimitris Bertsimas and Ryan Cory-Wright and Jean Pauphilet},
  title = {Solving Large-Scale Sparse PCA to Certifiable (Near) Optimality},
  year = {2020}
}
Besta M, Carigiet A, Vonarburg-Shmaria Z, Janda K, Gianinazzi L and Hoefler T (2020), "High-Performance Parallel Graph Coloring with Strong Guarantees on Work, Depth, and Quality", Proceedings of the ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis, November 2020., August, 2020.
Abstract: We develop the first parallel graph coloring heuristics with strong theoretical guarantees on work and depth and coloring quality. The key idea is to design a relaxation of the vertex degeneracy order, a well-known graph theory concept, and to color vertices in the order dictated by this relaxation. This introduces a tunable amount of parallelism into the degeneracy ordering that is otherwise hard to parallelize. This simple idea enables significant benefits in several key aspects of graph coloring. For example, one of our algorithms ensures polylogarithmic depth and a bound on the number of used colors that is superior to all other parallelizable schemes, while maintaining work-efficiency. In addition to provable guarantees, the developed algorithms have competitive run-times for several real-world graphs, while almost always providing superior coloring quality. Our degeneracy ordering relaxation is of separate interest for algorithms outside the context of coloring.
BibTeX:
@article{Besta2020,
  author = {Maciej Besta and Armon Carigiet and Zur Vonarburg-Shmaria and Kacper Janda and Lukas Gianinazzi and Torsten Hoefler},
  title = {High-Performance Parallel Graph Coloring with Strong Guarantees on Work, Depth, and Quality},
  journal = {Proceedings of the ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis, November 2020},
  year = {2020}
}
Bian H, Huang J, Liu L, Huang D and Wang X (2020), "ALBUS: A method for efficiently processing SpMV using SIMD and Load balancing", Future Generation Computer Systems., 11, 2020. Elsevier BV.
Abstract: SpMV (Sparse matrix-vector multiplication) is widely used in many fields. Improving the performance of SpMV has been the pursuit of many researchers. Parallel SpMV using multi-core processors has been a standard parallel method used by researchers. In reality, the number of non-zero elements in many sparse matrices is not evenly distributed, so parallelism without preprocessing will cause a large amount of performance loss due to uneven load. In this paper, we propose ALBUS (Absolute Load Balancing Using SIMD (Single Instruction Multiple Data)), a method for efficiently processing SpMV using load balancing and SIMD vectorization. On the one hand, ALBUS can achieve multi-core balanced load processing; on the other hand, it gives full play to the ability of SIMD vectorization parallelism under the CPU. We selected 20 sets of regular matrices and 20 sets of irregular matrices to form the Benchmark suite. We performed SpMV performance comparison tests on ALBUS, CSR5 (Compressed Sparse Row 5), Merge(Merge-based SpMV), and MKL (Math Kernel Library) under the same conditions. On the E5-2670 v3 CPU platform, For 20 sets of regular matrices, ALBUS can achieve an average speedup of 1.59x, 1.32x, 1.48x (up to 2.53x, 2.22x, 2.31x) compared to CSR5, Merge, MKL, respectively. For 20 sets of irregular matrices, ALBUS can achieve an average speedup of 1.38x, 1.42x, 2.44x (up to 2.33x, 2.24x, 5.37x) compared to CSR5, Merge, MKL, respectively.
BibTeX:
@article{Bian2020,
  author = {Haodong Bian and Jianqiang Huang and Lingbin Liu and Dongqiang Huang and Xiaoying Wang},
  title = {ALBUS: A method for efficiently processing SpMV using SIMD and Load balancing},
  journal = {Future Generation Computer Systems},
  publisher = {Elsevier BV},
  year = {2020},
  doi = {10.1016/j.future.2020.10.036}
}
Blanco MP, McMillan S and Low TM (2020), "Towards an Objective Metric for the Performance of Exact Triangle Count", September, 2020.
Abstract: The performance of graph algorithms is often measured in terms of the number of traversed edges per second (TEPS). However, this performance metric is inadequate for a graph operation such as exact triangle counting. In triangle counting, execution times on graphs with a similar number of edges can be distinctly different as demonstrated by results from the past Graph Challenge entries. We discuss the need for an objective performance metric for graph operations and the desired characteristics of such a metric such that it more accurately captures the interactions between the amount of work performed and the capabilities of the hardware on which the code is executed. Using exact triangle counting as an example, we derive a metric that captures how certain techniques employed in many implementations improve performance. We demonstrate that our proposed metric can be used to evaluate and compare multiple approaches for triangle counting, using a SIMD approach as a case study against a scalar baseline.
BibTeX:
@article{Blanco2020,
  author = {Mark P. Blanco and Scott McMillan and Tze Meng Low},
  title = {Towards an Objective Metric for the Performance of Exact Triangle Count},
  year = {2020}
}
Bogle I, Boman EG, Devine K and Slota GM (2020), "Distributed Memory Graph Coloring Algorithms forMultiple GPUs"
Abstract: Graph coloring is often used in parallelizing scientific computations that run in distributed and multi-GPU environments; it identifies sets of independent data that can be updated in parallel. Many algorithms exist for graph coloring on a single GPU or in distributed memory, but hybrid MPI+GPU algorithms have been unexplored until this work, to the best of our knowledge. We present several MPI+GPU coloring approaches that use implementations of the distributed coloring algorithms of Gebremedhin et al. and the shared-memory algorithms of Deveci et al. The on-node parallel coloring uses implementations in KokkosKernels, which provide parallelization for both multicore CPUs and GPUs. We further extend our approaches to solve for distance-2 coloring, giving the first known distributed and multiGPU algorithm for this problem. In addition, we propose novel methods to reduce communication in distributed graph coloring. Our experiments show that our approaches operate efficiently on inputs too large to fit on a single GPU and scale up to graphs with 76.7 billion edges running on 128 GPUs.
BibTeX:
@article{Bogle2020,
  author = {Ian Bogle and Erik G. Boman and Karen Devine and George M. Slota},
  title = {Distributed Memory Graph Coloring Algorithms forMultiple GPUs},
  year = {2020},
  url = {http://www.cs.rpi.edu/ slotag/pub/Coloring-IA320.pdf}
}
Boley D (2020), "On Fast Computation of Directed Graph Laplacian Pseudo-Inverse", February, 2020.
Abstract: The Laplacian matrix and its pseudo-inverse for a strongly connected directed graph is fundamental in computing many properties of a directed graph. Examples include random-walk centrality and betweenness measures, average hitting and commute times, and other connectivity measures. These measures arise in the analysis of many social and computer networks. In this short paper, we show how a linear system involving the Laplacian may be solved in time linear in the number of edges, times a factor depending on the separability of the graph. This leads directly to the column-by-column computation of the entire Laplacian pseudo-inverse in time quadratic in the number of nodes, i.e., constant time per matrix entry. The approach is based on "off-the-shelf" iterative methods for which global linear convergence is guaranteed, without recourse to any matrix elimination algorithm.
BibTeX:
@article{Boley2020,
  author = {Daniel Boley},
  title = {On Fast Computation of Directed Graph Laplacian Pseudo-Inverse},
  year = {2020}
}
Bolte J and Pauwels E (2020), "Curiosities and counterexamples in smooth convex optimization", January, 2020.
Abstract: Counterexamples to some old-standing optimization problems in the smooth convex coercive setting are provided. We show that block-coordinate, steepest descent with exact search or Bregman descent methods do not generally converge. Other failures of various desirable features are established: directional convergence of Cauchy's gradient curves, convergence of Newton's flow, finite length of Tikhonov path, convergence of central paths, or smooth Kurdyka-Lojasiewicz inequality. All examples are planar. These examples are based on general smooth convex interpolation results. Given a decreasing sequence of positively curved C k convex compact sets in the plane, we provide a level set interpolation of a C k smooth convex function where k ge 2 is arbitrary. If the intersection is reduced to one point our interpolant has positive definite Hessian, otherwise it is positive definite out of the solution set. Furthermore , given a sequence of decreasing polygons we provide an interpolant agreeing with the vertices and whose gradients coincide with prescribed normals.
BibTeX:
@article{Bolte2020,
  author = {Jerome Bolte and Edouard Pauwels},
  title = {Curiosities and counterexamples in smooth convex optimization},
  year = {2020}
}
Bonifati A, Dumbrava S and Kondylakis H (2020), "Graph Summarization", April, 2020.
Abstract: The continuous and rapid growth of highly interconnected datasets, which are both voluminous and complex, calls for the development of adequate processing and analytical techniques. One method for condensing and simplifying such datasets is graph summarization. It denotes a series of application-specific algorithms designed to transform graphs into more compact representations while preserving structural patterns, query answers, or specific property distributions. As this problem is common to several areas studying graph topologies, different approaches, such as clustering, compression, sampling, or influence detection, have been proposed, primarily based on statistical and optimization methods. The focus of our chapter is to pinpoint the main graph summarization methods, but especially to focus on the most recent approaches and novel research trends on this topic, not yet covered by previous surveys.
BibTeX:
@article{Bonifati2020,
  author = {Angela Bonifati and Stefania Dumbrava and Haridimos Kondylakis},
  title = {Graph Summarization},
  year = {2020}
}
Booth JD and Bolet G (2020), "An On-Node Scalable Sparse Incomplete LU Factorization for a Many-Core Iterative Solver with Javelin", Parallel Computing., 3, 2020. , pp. 102622. Elsevier BV.
Abstract: We present a scalable incomplete LU factorization to be used as a preconditioner for solving sparse linear systems with iterative methods in the package called Javelin. Javelin allows for improved parallel factorization on shared-memory many-core systems by packaging the coefficient matrix into a format that allows for high performance sparse matrix-vector multiplication and sparse triangular solves with minimal overheads. The framework achieves these goals by using a collection of traditional permutations, point-to-point thread synchronizations, tasking, and segmented prefix scans in a conventional compressed sparse row (CSR) format. Moreover, this framework stresses the importance of co-designing dependent tasks, such as sparse factorization and triangular solves, on highly-threaded architectures. We compare our method to the past distributed methods for incomplete factorization (Aztec) and current multithreaded packages (WSMP) in order to demonstrate the importance of having highly threaded factorizations on many-core systems. Using these changes, traditional fill-in and drop tolerance methods can be used, while still being able to have observed speedups of up to ∼ 42 × on 68 Intel Knights Landing cores and ∼ 12 × on 14 Intel Haswell cores. Moreover, this work provides insight into how the new data-structure impacts iteration counts, and provides insight into future improvements, such as point to GPUs.
BibTeX:
@article{Booth2020,
  author = {Joshua Dennis Booth and Gregory Bolet},
  title = {An On-Node Scalable Sparse Incomplete LU Factorization for a Many-Core Iterative Solver with Javelin},
  journal = {Parallel Computing},
  publisher = {Elsevier BV},
  year = {2020},
  pages = {102622},
  doi = {10.1016/j.parco.2020.102622}
}
Booth JD (2020), "Auto Adaptive Irregular OpenMP Loops", July, 2020.
Abstract: OpenMP is a standard for the parallelization due to the ease in programming parallel-for loops in a fork-join manner. Many shared-memory applications are implemented using this model despite not being ideal for applications with high load imbalance, such as those that make irregular memory accesses. One parameter, i.e., chunk size, is made available to users in order to mitigate performance loss. However, this parameter is dependent on architecture, system load, application, and input; making it difficult to tune. We present an OpenMP scheduler that does an adaptive tuning for chunk size for unbalanced applications that make irregular memory accesses. In particular, this method(iCh) uses work-stealing for imbalance and adapts chunk size using a force-feedback model that approximates variance of task length in a chunk. This scheduler has low overhead and allows for active load balancing while the applications are running. We demonstrate this using both sparse matrix-vector multiplication (spmv) and Betweenness Centrality (bc) and show that iCh can achieve average speedups close (i.e., within 1.061x for spmv and 1.092x for bc) of either OpenMP loops scheduled with dynamic or work-stealing methods that had chunk size tuned offline.
BibTeX:
@article{Booth2020a,
  author = {Joshua Dennis Booth},
  title = {Auto Adaptive Irregular OpenMP Loops},
  year = {2020}
}
Bosilca G, Harrison R, Herault T, Javanmard M, Nookala P and Valeev E (2020), "The Template Task Graph (TTG) -- an emergingpractical dataflow programming paradigm forscientific simulation at extreme scale", In Proceedings of the 2020 IEEE/ACM 5th International Workshop on Extreme Scale Programming Models and Middleware.
Abstract: We describe TESSE, an emerging general-purpose, open-source software ecosystem that attacks the twin challenges of programmer productivity and portable performance for advanced scientific applications on modern high-performance computers. TESSE builds upon and extends the PARSEC DAG/- dataflow runtime with a new Domain Specific Languages (DSL) and new integration capabilities. Motivating this work is our belief that such a dataflow model, perhaps with applications composed in domain specific languages, can overcome many of the challenges faced by a wide variety of irregular applications that are poorly served by current programming and execution models. Two such applications from many-body physics and applied mathematics are briefly explored. This paper focuses upon the Template Task Graph (TTG), which is TESSE's main C++ API that provides a powerful work/data-flow programming model. Algorithms on spatial trees, block-sparse tensors, and wave fronts are used to illustrate the API and associated concepts, as well as to compare with related approaches.
BibTeX:
@inproceedings{Bosilca2020,
  author = {G. Bosilca and R.J. Harrison and T. Herault and M.M. Javanmard and P. Nookala and E.F. Valeev},
  title = {The Template Task Graph (TTG) -- an emergingpractical dataflow programming paradigm forscientific simulation at extreme scale},
  booktitle = {Proceedings of the 2020 IEEE/ACM 5th International Workshop on Extreme Scale Programming Models and Middleware},
  year = {2020}
}
lena Botoeva, Kouvaros P, Kronqvist J, Lomuscio A and Misener R (2020), "Efficient Verification of ReLU-based Neural Networks via Dependency Analysis", In Proceedings of the 34th AAAI Conference on Artificial Intelligence.
Abstract: We introduce an efficient method for the verification of ReLU-based feed-forward neural networks. We derive an automated procedure that exploits dependency relations between the ReLU nodes, thereby pruning the search tree that needs to be considered by MILP-based formulations of the verification problem. We augment the resulting algorithm with methods for input domain splitting and symbolic interval propagation. We present Venus, the resulting verification toolkit, and evaluate it on the ACAS collision avoidance networks and models trained on the MNIST and CIFAR-10 datasets. The experimental results obtained indicate considerable gains over the present state-of-the-art tools.
BibTeX:
@inproceedings{Botoeva2020,
  author = {lena Botoeva and Panagiotis Kouvaros and Jan Kronqvist and Alessio Lomuscio and Ruth Misener},
  title = {Efficient Verification of ReLU-based Neural Networks via Dependency Analysis},
  booktitle = {Proceedings of the 34th AAAI Conference on Artificial Intelligence},
  year = {2020}
}
Boukaram W, Lucchesi M, Turkiyyah G, Ma\itre OL, Knio O and Keyes D (2020), "Hierarchical matrix approximations for space-fractional diffusion equations", Computer Methods in Applied Mechanics and Engineering., 9, 2020. Vol. 369, pp. 113191. Elsevier BV.
Abstract: Space fractional diffusion models generally lead to dense discrete matrix operators, which lead to substantial computational challenges when the system size becomes large. For a state of size N, full representation of a fractional diffusion matrix would require O(N^2) memory storage requirement, with a similar estimate for matrix–vector products. In this work, we present ℋ^2 matrix representation and algorithms that are amenable to efficient implementation on GPUs, and that can reduce the cost of storing these operators to O(N) asymptotically. Matrix–vector multiplications can be performed in asymptotically linear time as well. Performance of the algorithms is assessed in light of 2D simulations of space fractional diffusion equation with constant diffusivity. Attention is focused on smooth particle approximation of the governing equations, which lead to discrete operators involving explicit radial kernels. The algorithms are first tested using the fundamental solution of the unforced space fractional diffusion equation in an unbounded domain, and then for the steady, forced, fractional diffusion equation in a bounded domain. Both matrix-inverse and pseudo-transient solution approaches are considered in the latter case. Our experiments show that the construction of the fractional diffusion matrix, the matrix–vector multiplication, and the generation of an approximate inverse pre-conditioner all perform very well on a single GPU on 2D problems with N in the range 10^5 -- 10^6. In addition, the tests also showed that, for the entire range of parameters and fractional orders considered, results obtained using the ℋ^2 approximations were in close agreement with results obtained using dense operators, and exhibited the same spatial order of convergence. Overall, the present experiences showed that the ℋ^2 matrix framework promises to provide practical means to handle large-scale space fractional diffusion models in several space dimensions, at a computational cost that is asymptotically similar to the cost of handling classical diffusion equations.
BibTeX:
@article{Boukaram2020,
  author = {Wajih Boukaram and Marco Lucchesi and George Turkiyyah and Olivier Le Ma\itre and Omar Knio and David Keyes},
  title = {Hierarchical matrix approximations for space-fractional diffusion equations},
  journal = {Computer Methods in Applied Mechanics and Engineering},
  publisher = {Elsevier BV},
  year = {2020},
  volume = {369},
  pages = {113191},
  doi = {10.1016/j.cma.2020.113191}
}
Bramas B and Ketterlin A (2020), "Improving parallel executions by increasing task granularity in task-based runtime systems using acyclic DAG clustering", PeerJ Computer Science., 1, 2020. Vol. 6, pp. e247. PeerJ.
Abstract: The task-based approach is a parallelization paradigm in which an algorithm is transformed into a direct acyclic graph of tasks: the vertices are computational elements extracted from the original algorithm and the edges are dependencies between those. During the execution, the management of the dependencies adds an overhead that can become significant when the computational cost of the tasks is low. A possibility to reduce the makespan is to aggregate the tasks to make them heavier, while having fewer of them, with the objective of mitigating the importance of the overhead. In this paper, we study an existing clustering/partitioning strategy to speed up the parallel execution of a task-based application. We provide two additional heuristics to this algorithm and perform an in-depth study on a large graph set. In addition, we propose a new model to estimate the execution duration and use it to choose the proper granularity. We show that this strategy allows speeding up a real numerical application by a factor of 7 on a multi-core system.
BibTeX:
@article{Bramas2020,
  author = {Bérenger Bramas and Alain Ketterlin},
  title = {Improving parallel executions by increasing task granularity in task-based runtime systems using acyclic DAG clustering},
  journal = {PeerJ Computer Science},
  publisher = {PeerJ},
  year = {2020},
  volume = {6},
  pages = {e247},
  doi = {10.7717/peerj-cs.247}
}
Brayford D, Bernau C, Hesse W and Guillen C (2020), "Analyzing Performance Properties Collected by the PerSyst Scalable HPC Monitoring Tool", September, 2020.
Abstract: The ability to understand how a scientific application is executed on a large HPC system is of great importance in allocating resources within the HPC data center. In this paper, we describe how we used system performance data to identify: execution patterns, possible code optimizations and improvements to the system monitoring. We also identify candidates for employing machine learning techniques to predict the performance of similar scientific codes.
BibTeX:
@article{Brayford2020,
  author = {David Brayford and Christoph Bernau and Wolfram Hesse and Carla Guillen},
  title = {Analyzing Performance Properties Collected by the PerSyst Scalable HPC Monitoring Tool},
  year = {2020}
}
Brock B, Buluç A, Mattson TG, McMillan S and Moreira JE (2020), "A Roadmap for the GraphBLAS C++ API". Thesis at: U.C. Berkeley.
Abstract: The GraphBLAS is an API for graph algorithms expressed in terms of linear algebra. The current GraphBLAS specification is for the C Programming Language. Implementations of the GraphBLAS exposed a number of limitations due to C that restrict both the expressiveness and the performance of the GraphBLAS. The C++ language's first-class support for generics, including template metaprogramming, addresses these limitations, yielding a simpler GraphBLAS API that should deliver better performance especially for methods based on user-defined types and operators. When combined with the pervasiveness of C++ across many domains as well as within largescale distributed codes, we see a compelling argument to define a GraphBLAS C++ API. This paper presents a roadmap for the development of a GraphBLAS C++ API with a focus on the open issues we must resolve before completing the specification. Our goal is to foster discussion within the GraphBLAS user community and receive feedback on the directions we are taking with the GraphBLAS C++ API.
BibTeX:
@techreport{Brock2020,
  author = {Benjamin Brock and Aydın Buluç and Timothy G. Mattson and Scott McMillan and Jose E. Moreira},
  title = {A Roadmap for the GraphBLAS C++ API},
  school = {U.C. Berkeley},
  year = {2020}
}
Brock B, Buluc A, Mattson TG, McMillan S, Moreira JE, Pearce R, Selvitopi O and Steil T (2020), "Considerations for a Distributed GraphBLAS API", In Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops., 5, 2020. IEEE.
Abstract: The GraphBLAS emerged from an international effort to standardize linear-algebraic building blocks for computing on graphs and graph-structured data. The GraphBLAS is expressed as a C API and has paved the way for multiple implementations. The GraphBLAS C API, however, does not define how distributed-memory parallelism should be handled. This paper reviews various approaches for a GraphBLAS API for distributed computing. This work is guided by our experience with existing distributed memory libraries. Our goal for this paper is to highlight the pros and cons of different approaches rather than to advocate for one particular choice.
BibTeX:
@inproceedings{Brock2020a,
  author = {Benjamin Brock and Aydin Buluc and Timothy G. Mattson and Scott McMillan and Jose E. Moreira and Roger Pearce and Oguz Selvitopi and Trevor Steil},
  title = {Considerations for a Distributed GraphBLAS API},
  booktitle = {Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops},
  publisher = {IEEE},
  year = {2020},
  doi = {10.1109/ipdpsw50202.2020.00048}
}
Brown C, Abdelfattah A, Tomov S and Dongarra J (2020), "Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs". Thesis at: University of Tennessee.
Abstract: Dense linear algebra (DLA) has historically been in the vanguard of software that must be adapted first to hardware changes. This is because DLA is both critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the problems that novel architectures pose. Therefore, in this paper we investigate the portability of the MAGMA DLA library to the latest AMD GPUs. We use auto tools to convert the CUDA code in MAGMA to the HeterogeneousComputing Interface for Portability (HIP) language. MAGMA provides LAPACK for GPUs and benchmarks for fundamental DLA routines ranging from BLAS to dense factorizations, linear systems and eigen-problem solvers. We port these routines to HIP and quantify currently achievable performance through the MAGMA benchmarks for the main workload algorithms on MI25 and MI50 AMD GPUs. Comparison with performance roofline models and theoretical expectations are used to identify current limitations and directions for future improvements.
BibTeX:
@techreport{Brown2020,
  author = {Cade Brown and Ahmad Abdelfattah and Stanimire Tomov and Jack Dongarra},
  title = {Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs},
  school = {University of Tennessee},
  year = {2020},
  url = {https://www.icl.utk.edu/files/publications/2020/icl-utk-1405-2020.pdf}
}
Brunie H, Iancu C, Ibrahim K, Brisk P and Cook B (2020), "Tuning Floating-Point Precision Using Dynamic Program Information and Temporal Locality", In Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis. Los Alamitos, CA, USA, 11, 2020. , pp. 694-707. IEEE Computer Society.
Abstract: We present a methodology for precision tuning of full applications. These techniquesmust select a search space composed of either variables or instructions and provide a scalable search strategy. In full application settings one cannot assume compiler support for practical reasons. Thus, an additional important challenge is enabling code refactoring. We argue for an instruction-based search space and we show: 1) how to exploit dynamic program information based on call stacks; and 2) how to exploit the iterative nature of scientific codes, combined with temporal locality. We applied the methodology to tune the implementation of scientific codes written in a combination of Python, CUDA, C++ and Fortran, tuning calls to math exp library functions. The iterative search refinement always reduces the search complexity and the number of steps to solution. Dynamic program information increases search efficacy. Using this approach, we obtain application runtime performance improvements up to 27%.
BibTeX:
@inproceedings{Brunie2020,
  author = {H. Brunie and C. Iancu and K. Ibrahim and P. Brisk and B. Cook},
  title = {Tuning Floating-Point Precision Using Dynamic Program Information and Temporal Locality},
  booktitle = {Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis},
  publisher = {IEEE Computer Society},
  year = {2020},
  pages = {694--707},
  url = {https://doi.ieeecomputersociety.org/10.1109/SC41405.2020.00054},
  doi = {10.1109/SC41405.2020.00054}
}
Brunie H, Iancu C, Ibrahim K, Brisk P and Cook B (2020), "Tuning Floating-Point Precision Using Dynamic Program Information and Temporal Locality", In Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis. Los Alamitos, CA, USA, 11, 2020. , pp. 694-707. IEEE Computer Society.
Abstract: We present a methodology for precision tuning of full applications. These techniquesmust select a search space composed of either variables or instructions and provide a scalable search strategy. In full application settings one cannot assume compiler support for practical reasons. Thus, an additional important challenge is enabling code refactoring. We argue for an instruction-based search space and we show: 1) how to exploit dynamic program information based on call stacks; and 2) how to exploit the iterative nature of scientific codes, combined with temporal locality. We applied the methodology to tune the implementation of scientific codes written in a combination of Python, CUDA, C++ and Fortran, tuning calls to math exp library functions. The iterative search refinement always reduces the search complexity and the number of steps to solution. Dynamic program information increases search efficacy. Using this approach, we obtain application runtime performance improvements up to 27%.
BibTeX:
@inproceedings{Brunie2020a,
  author = {H. Brunie and C. Iancu and K. Ibrahim and P. Brisk and B. Cook},
  title = {Tuning Floating-Point Precision Using Dynamic Program Information and Temporal Locality},
  booktitle = {Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis},
  publisher = {IEEE Computer Society},
  year = {2020},
  pages = {694-707},
  url = {https://doi.ieeecomputersociety.org/10.1109/SC41405.2020.00054},
  doi = {10.1109/SC41405.2020.00054}
}
Brust JJ, Leyffer S and Petra CG (2020), "Compact Representations of Structured BFGS Matrices"
Abstract: For general large-scale optimization problems compact representations exist in which recursive quasi-Newton update formulas are represented as compact matrix factorizations. For problems in which the objective function contains additional structure, so-called structured quasiNewton methods exploit available second-derivative information and approximate unavailable second derivatives. This article develops the compact representations of two structured Broyden-FletcherGoldfarb-Shanno update formulas. The compact representations enable efficient limited memory and initialization strategies. Two limited memory line search algorithms are described and tested on a collection of problems.
BibTeX:
@article{Brust2020,
  author = {J. J. Brust and S. Leyffer and C. G. Petra},
  title = {Compact Representations of Structured BFGS Matrices},
  year = {2020}
}
Bueler E (2020), "PETSc for Partial Differential Equations: Numerical Solutions in C and Python", 1, 2020. Society for Industrial and Applied Mathematics.
Abstract: The Portable, Extensible Toolkit for Scientific Computation (PETSc) is an open-source library of advanced data structures and methods for solving linear and nonlinear equations and for managing discretizations. This book uses these modern numerical tools to demonstrate how to solve nonlinear partial differential equations (PDEs) in parallel. It starts from key mathematical concepts, such as Krylov space methods, preconditioning, multigrid, and Newton's method. In PETSc these components are composed at run time into fast solvers.\ Discretizations are introduced from the beginning, with an emphasis on finite difference and finite element methodologies. The example C programs of the first 12 chapters, listed on the inside front cover, solve (mostly) elliptic and parabolic PDE problems. Discretization leads to large, sparse, and generally nonlinear systems of algebraic equations. For such problems, mathematical solver concepts are explained and illustrated through the examples, with sufficient context to speed further development.\
PETSc for Partial Differential Equationsitemize
item addresses both discretization and fast solvers for PDEs;
item emphasizes practice more than theory;
item contains well-structured examples, with advice on run-time solver choices;
item demonstrates how to achieve high performance and parallel scalability; and
item builds on the reader's understanding of fast solver concepts when applying the Firedrake
itemize
Python finite element solver library in the last two chapters.\ This textbook, the first to cover PETSc programming for nonlinear PDEs, provides an on-ramp for graduate students and researchers to a major area of high-performance computing for science and engineering. It is suitable as a supplement for courses in scientific computing or numerical methods for differential equations.
BibTeX:
@book{Bueler2020,
  author = {Ed Bueler},
  title = {PETSc for Partial Differential Equations: Numerical Solutions in C and Python},
  publisher = {Society for Industrial and Applied Mathematics},
  year = {2020},
  doi = {10.1137/1.9781611976311}
}
Bullins B and Lai KA (2020), "Higher-order methods for convex-concave min-max optimization and monotone variational inequalities", July, 2020.
Abstract: We provide improved convergence rates for constrained convex-concave min-max problems and monotone variational inequalities with higher-order smoothness. In min-max settings where the p^th-order derivatives are Lipschitz continuous, we give an algorithm HigherOrderMirrorProx that achieves an iteration complexity of O(1/T^p+12) when given access to an oracle for finding a fixed point of a p^th-order equation. We give analogous rates for the weak monotone variational inequality problem. For p>2, our results improve upon the iteration complexity of the first-order Mirror Prox method of Nemirovski [2004] and the second-order method of Monteiro and Svaiter [2012]. We further instantiate our entire algorithm in the unconstrained p=2 case.
BibTeX:
@article{Bullins2020,
  author = {Brian Bullins and Kevin A. Lai},
  title = {Higher-order methods for convex-concave min-max optimization and monotone variational inequalities},
  year = {2020}
}
Burke JV, Curtis FE, Wang H and Wang J (2020), "Inexact Sequential Quadratic Optimization with Penalty Parameter Updates within the QP Solver", SIAM Journal on Optimization., 1, 2020. Vol. 30(3), pp. 1822-1849. Society for Industrial & Applied Mathematics (SIAM).
Abstract: This paper focuses on the design of sequential quadratic optimization (commonly known as SQP) methods for solving large-scale nonlinear optimization problems. The most computationally demanding aspect of such an approach is the computation of the search direction during each iteration, for which we consider the use of matrix-free methods. In particular, we develop a method that requires an inexact solve of a single QP subproblem to establish the convergence of the overall SQP method. It is known that SQP methods can be plagued by poor behavior of the global convergence mechanism. To confront this issue, we propose the use of an exact penalty function with a dynamic penalty parameter updating strategy to be employed within the subproblem solver in such a way that the resulting search direction predicts progress toward both feasibility and optimality. We present our parameter updating strategy and prove that, under reasonable assumptions, the strategy does not modify the penalty parameter unnecessarily. We close the paper with a discussion of the results of numerical experiments that illustrate the benefits of our proposed techniques.
BibTeX:
@article{Burke2020,
  author = {James V. Burke and Frank E. Curtis and Hao Wang and Jiashan Wang},
  title = {Inexact Sequential Quadratic Optimization with Penalty Parameter Updates within the QP Solver},
  journal = {SIAM Journal on Optimization},
  publisher = {Society for Industrial & Applied Mathematics (SIAM)},
  year = {2020},
  volume = {30},
  number = {3},
  pages = {1822--1849},
  doi = {10.1137/18m1176488}
}
Buttari A, Huber M, Leleux P, Mary T, Rüde U and Wohlmuth B (2020), "Block Low Rank Single Precision Coarse Grid Solvers forExtreme Scale Multigrid Methods". Thesis at: Institut de recherche en informatique de Toulouse (IRIT).
Abstract: Extreme scale simulation requires fast and scalable algorithms, such as multigrid methods. To achieve asymptotically optimal complexity it is essential to employ a hierarchy of grids. The cost to solve the coarsest grid system can often be neglected in sequential computings, but cannot be ignored in massively parallel executions. In this case, the coarsest grid can be large and its efficient solution becomes a challenging task. We propose solving the coarse grid system using modern, approximate sparse direct methods and investigate the expected gains compared with traditional iterative methods. Since the coarse grid system only requires an approximate solution, we show that we can leverage block low-rank techniques, combined with the use of single precision arithmetic, to significantly reduce the computational requirements of the direct solver. In the case of extreme scale computing, the coarse grid system is too large for a sequential solution, but too small to permit massively parallel efficiency. We show that the agglomeration of the coarse grid system to a subset of processors is necessary for the sparse direct solver to achieve performance. We demonstrate the efficiency of the proposed method on a Stokes-type saddle point system. We employ a monolithic Uzawa multigrid method. In particular, we show that the use of an approximate sparse direct solver for the coarse grid system can outperform that of a preconditioned minimal residual iterative method. This is demonstrated for the multigrid solution of systems of order up to 1+e11 degrees of freedom on a petascale supercomputer using 43 200 processes.
BibTeX:
@techreport{Buttari2020,
  author = {Alfredo Buttari and Markus Huber and Philippe Leleux and Theo Mary and Ulrich Rüde and Barbara Wohlmuth},
  title = {Block Low Rank Single Precision Coarse Grid Solvers forExtreme Scale Multigrid Methods},
  school = {Institut de recherche en informatique de Toulouse (IRIT)},
  year = {2020},
  url = {https://hal.archives-ouvertes.fr/hal-02528532}
}
Cai Y and Li P (2020), "Solving the Robust Matrix Completion Problem via a System of Nonlinear Equations", March, 2020.
Abstract: We consider the problem of robust matrix completion, which aims to recover a low rank matrix L_* and a sparse matrix S_* from incomplete observations of their sum M=L_*+S_*∊ℝ^m× n. Algorithmically, the robust matrix completion problem is transformed into a problem of solving a system of nonlinear equations, and the alternative direction method is then used to solve the nonlinear equations. In addition, the algorithm is highly parallelizable and suitable for large scale problems. Theoretically, we characterize the sufficient conditions for when L_* can be approximated by a low rank approximation of the observed M_*. And under proper assumptions, it is shown that the algorithm converges to the true solution linearly. Numerical simulations show that the simple method works as expected and is comparable with state-of-the-art methods.
BibTeX:
@article{Cai2020,
  author = {Yunfeng Cai and Ping Li},
  title = {Solving the Robust Matrix Completion Problem via a System of Nonlinear Equations},
  year = {2020}
}
Calandra H, Gratton S, Riccietti E and Vasseur X (2020), "On a multilevel Levenberg–Marquardt method for the training of artificial neural networks and its application to the solution of partial differential equations", Optimization Methods and Software., 6, 2020. , pp. 1-26. Informa UK Limited.
Abstract: In this paper, we propose a new multilevel Levenberg–Marquardt optimizer for the training of artificial neural networks with quadratic loss function. This setting allows us to get further insight into the potential of multilevel optimization methods. Indeed, when the least squares problem arises from the training of artificial neural networks, the variables subject to optimization are not related by any geometrical constraints and the standard interpolation and restriction operators cannot be employed any longer. A heuristic, inspired by algebraic multigrid methods, is then proposed to construct the multilevel transfer operators. We test the new optimizer on an important application: the approximate solution of partial differential equations by means of artificial neural networks. The learning problem is formulated as a least squares problem, choosing the nonlinear residual of the equation as a loss function, whereas the multilevel method is employed as a training method. Numerical experiments show encouraging results related to the efficiency of the new multilevel optimization method compared to the corresponding one-level procedure in this context.
BibTeX:
@article{Calandra2020,
  author = {H. Calandra and S. Gratton and E. Riccietti and X. Vasseur},
  title = {On a multilevel Levenberg–Marquardt method for the training of artificial neural networks and its application to the solution of partial differential equations},
  journal = {Optimization Methods and Software},
  publisher = {Informa UK Limited},
  year = {2020},
  pages = {1--26},
  doi = {10.1080/10556788.2020.1775828}
}
Campos JS, Misener R and Parpas P (2020), "Partial Lasserre relaxation for sparse Max-Cut"
Abstract: A common approach to solve or find bounds of polynomial optimization problems like Max-Cut is to use the first level of the Lasserre hierarchy. Higher levels of the Lasserre hierarchy provide tighter bounds, but solving these relaxations is usually computationally intractable. We propose to strengthen the first level relaxation for sparse Max-Cut problems using constraints from the second order Lasserre hierarchy. We explore a variety of approaches for adding a subset of the positive semidefinite constraints of the second order sparse relaxation obtained by using the maximum cliques of the graph's chordal extension. We apply this idea to sparse graphs of different sizes and densities, and provide evidence of its strengths and limitations when compared to the state-of-the-art Max-Cut solver BiqCrunch and the alternative sparse relaxation CS-TSSOS.
BibTeX:
@article{Campos2020,
  author = {Juan S. Campos and Ruth Misener and Panos Parpas},
  title = {Partial Lasserre relaxation for sparse Max-Cut},
  year = {2020}
}
Cao X and Liu KJR (2020), "Distributed Newton's Method for Network Cost Minimization", IEEE Transactions on Automatic Control. , pp. 1-1. Institute of Electrical and Electronics Engineers (IEEE).
Abstract: In this work, we examine a novel generic network cost minimization problem, in which every node has a local decision vector to optimize. Each node incurs a cost associated with its decision vector while each link incurs a cost related to the decision vectors of its two end nodes. All nodes collaborate to minimize the overall network cost. The formulated network cost minimization problem has broad applications in distributed signal processing and control, in which the notion of link costs often arises. To solve this problem in a decentralized manner, we develop a distributed variant of the Newton's method, which possesses faster convergence than alternative first order optimization methods such as gradient descent and alternating direction method of multipliers. The proposed method is based on an appropriate splitting of the Hessian matrix and an approximation of its inverse, which is used to determine the Newton step. Global linear convergence of the proposed algorithm is established under several standard technical assumptions on the local cost functions. Furthermore, analogous to classical centralized Newton's method, a quadratic convergence phase of the algorithm over a certain time interval is identified. Finally, numerical simulations are conducted to validate the effectiveness of the proposed algorithm and its superiority over other first order methods, especially when the cost functions are ill-conditioned. Complexity issues of the proposed distributed Newton's method and alternative first order methods are also discussed.
BibTeX:
@article{Cao2020,
  author = {Xuanyu Cao and K. J. Ray Liu},
  title = {Distributed Newton's Method for Network Cost Minimization},
  journal = {IEEE Transactions on Automatic Control},
  publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
  year = {2020},
  pages = {1--1},
  doi = {10.1109/tac.2020.2989266}
}
Cao Q, Pei Y, Akbudak K, Bosilca G, Ltaief H, Keyes DE and Dongarra J (2020), "Leveraging PaRSEC Runtime Support to Tackle Challenging 3D Data-Sparse Matrix Problems", In Proceedings of the 35th IEEE International Parallel & Distributed Processing Symposium.
Abstract: The task-based programming model associated with dynamic runtime systems has gained popularity for challenging problems because of workload imbalance, heterogeneous resources, or extreme concurrency. During the last decade, lowrank matrix approximations, where the main idea consists of exploiting data sparsity typically by compressing off-diagonal tiles up to an application-specific accuracy threshold, have been adopted to address the curse of dimensionality at extreme scale. In this paper, we create a bridge between the runtime and the linear algebra by communicating knowledge of the data sparsity to the runtime. We design and implement this synergistic approach with high user productivity in mind, in the context of the PaRSEC runtime system and the HiCMA numerical library. This requires to extend PaRSEC with new features to integrate rank information into the dataflow so that proper decisions can be taken at runtime. We focus on the tile low-rank (TLR) Cholesky factorization for solving 3D data-sparse covariance matrix problems arising in environmental applications. In particular, we employ the 3D exponential model of Matern matrix kernel, which exhibits challenging nonuniform high ranks in off-diagonal tiles. We first provide a dynamic data structure management driven by a performance model to reduce extra floating-point operations. Next, we optimize the memory footprint of the application by relying on a dynamic memory allocator, and supported by a rank-aware data distribution to cope with the workload imbalance. Finally, we expose further parallelism using kernel recursive formulations to shorten the critical path. Our resulting high-performance implementation outperforms existing data-sparse TLR Cholesky factorization by up to 7-fold on a large-scale distributed-memory system, while minimizing the memory footprint up to a 44-fold factor. This multidisciplinary work highlights the need to empower runtime systems beyond their original duty of task scheduling for servicing next-generation low-rank matrix algebra libraries.
BibTeX:
@inproceedings{Cao2020a,
  author = {Cao, Qinglei and Pei, Yu and Akbudak, Kadir and Bosilca, George and Ltaief, Hatem and Keyes, David E. and Dongarra, Jack},
  title = {Leveraging PaRSEC Runtime Support to Tackle Challenging 3D Data-Sparse Matrix Problems},
  booktitle = {Proceedings of the 35th IEEE International Parallel & Distributed Processing Symposium},
  year = {2020}
}
Carmon Y and Duchi JC (2020), "First-Order Methods for Nonconvex Quadratic Minimization", March, 2020.
Abstract: We consider minimization of indefinite quadratics with either trust-region (norm) constraints or cubic regularization. Despite the nonconvexity of these problems we prove that, under mild assumptions, gradient descent converges to their global solutions, and give a non-asymptotic rate of convergence for the cubic variant. We also consider Krylov subspace solutions and establish sharp convergence guarantees to the solutions of both trust-region and cubic-regularized problems. Our rates mirror the behavior of these methods on convex quadratics and eigenvector problems, highlighting their scalability. When we use Krylov subspace solutions to approximate the cubic-regularized Newton step, our results recover the strongest known convergence guarantees to approximate second-order stationary points of general smooth nonconvex functions.
BibTeX:
@article{Carmon2020,
  author = {Yair Carmon and John C. Duchi},
  title = {First-Order Methods for Nonconvex Quadratic Minimization},
  year = {2020}
}
Carratalá-Sáez R, Faverge M, Pichon G, Sylvand G and Quintana-Ortí ES (2020), "Tiled Algorithms for Efficient Task-Parallel H-Matrix Solvers". Thesis at: INRIA.
Abstract: In this paper, we describe and evaluate an extension of the Chameleon library to operate with hierarchical matrices (H-Matrices) and hierarchical arithmetic (H-Arithmetic), producing efficient solvers for linear systems arising in Boundary Element Methods (BEM). Our approach builds upon an open-source H-Matrices library from Airbus, named Hmat-oss, that collects sequential numerical kernels for both hierarchical and low-rank structures; the tiled algorithms and task-parallel decompositions available in Chameleon for the solution of linear systems; and the StarPU runtime system to orchestrate an efficient task-parallel (multi-threaded) execution on a multicore architecture. Using an application producing matrices with features close to real industrial applications, we present shared-memory results that demonstrate a fair level of performance, close to (and sometimes better than) the one offered by a pure H-Matrix approach, as proposed by Airbus Hmat proprietary (and non open-source) library. Hence, this combination Chameleon + Hmat-oss proposes the most efficient fully open-source software stack to solve dense compressible linear systems on shared memory architectures (distributed memory is under development).
BibTeX:
@techreport{Carratala2020,
  author = {Rocío Carratalá-Sáez and Mathieu Faverge and Grégoire Pichon and Guillaume Sylvand and Enrique S. Quintana-Ortí},
  title = {Tiled Algorithms for Efficient Task-Parallel H-Matrix Solvers},
  school = {INRIA},
  year = {2020},
  url = {https://hal.inria.fr/hal-02489269}
}
Carson E and Strakoš Z (2020), "On the cost of iterative computations", Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences., 1, 2020. Vol. 378(2166), pp. 20190050. The Royal Society.
Abstract: With exascale-level computation on the horizon, the art of predicting the cost of computations has acquired a renewed focus. This task is especially challenging in the case of iterative methods, for which convergence behaviour often cannot be determined with certainty a priori (unless we are satisfied with potentially outrageous overestimates) and which typically suffer from performance bottlenecks at scale due to synchronization cost. Moreover, the amplification of rounding errors can substantially affect the practical performance, in particular for methods with short recurrences. In this article, we focus on what we consider to be key points which are crucial to understanding the cost of iteratively solving linear algebraic systems. This naturally leads us to questions on the place of numerical analysis in relation to mathematics, computer science and sciences, in general.
BibTeX:
@article{Carson2020,
  author = {Carson, Erin and Strakoš, Zdeněk},
  title = {On the cost of iterative computations},
  journal = {Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences},
  publisher = {The Royal Society},
  year = {2020},
  volume = {378},
  number = {2166},
  pages = {20190050},
  doi = {10.1098/rsta.2019.0050}
}
Carson E, Higham NJ and Pranesh S (2020), "Three-Precision GMRES-Based Iterative Refinement for Least Squares Problems"
Abstract: The standard iterative refinement procedure for improving an approximate solution to the least squares problem min_x ||b - Ax||_2, where A ∊ ℝ^m × n with m ge n has full rank, is based on solving the (m + n) × (m + n) augmented system with the aid of a QR factorization. In order to exploit multiprecision arithmetic, iterative refinement can be formulated to use three precisions, but the resulting algorithm converges only for a limited range of problems. We build an iterative refinement algorithm called GMRES-LSIR, analogous to the GMRES-IR algorithm developed for linear systems [SIAM J. Sci. Comput., 40 (2019), pp. A817-A847], that solves the augmented system using GMRES preconditioned by a matrix based on the computed QR factors. We explore two left preconditioners; the first has full off-diagonal blocks and the second is block diagonal and can be applied in either left-sided or split form. We prove that for a wide range of problems the first preconditioner yields backward and forward errors for the augmented system of order the working precision under suitable assumptions on the precisions and the problem conditioning. Our proof does not extend to the block diagonal preconditioner, but our numerical experiments show that with this preconditioner the algorithm performs about as well in practice.
BibTeX:
@article{Carson2020a,
  author = {Carson, Erin and Higham, Nicholas J. and Pranesh, Srikara},
  title = {Three-Precision GMRES-Based Iterative Refinement for Least Squares Problems},
  year = {2020},
  url = {http://eprints.maths.manchester.ac.uk/2745/1/paper.pdf}
}
Carson E and Gergelits T (2020), "Quarter I Report: Initial Exploration of the Use of Mixed Precision in Iterative Solvers". Thesis at: LLNL-Charles University.
Abstract: The first quarter of the project was primarily spent identifying potential projects at the intersection of finite precision analysis, mixed precision computation, and Krylov subspace methods. We summarize our findings in the remainder of the document. Other activities include attending biweekly xSDK meetings as well as contributing material to the technical report and journal versions of the multiprecision landscape paper.\ The subsequent quarter will be spent selecting a subset of the described projects to focus on, performing initial numerical experiments to evaluate the potential for the use of mixed precision, and developing initial theoretical analysis.
BibTeX:
@techreport{Carson2020b,
  author = {Erin Carson and Tomáš Gergelits},
  title = {Quarter I Report: Initial Exploration of the Use of Mixed Precision in Iterative Solvers},
  school = {LLNL-Charles University},
  year = {2020}
}
Carson E, Lund K, Rozložník M and Thomas S (2020), "An overview of block Gram-Schmidt methods and their stability properties", October, 2020.
Abstract: Block Gram-Schmidt algorithms comprise essential kernels in many scientific computing applications, but for many commonly used variants, a rigorous treatment of their stability properties remains open. This survey provides a comprehensive categorization of block Gram-Schmidt algorithms, especially those used in Krylov subspace methods to build orthonormal bases one block vector at a time. All known stability results are assembled, and new results are summarized or conjectured for important communication-reducing variants. A diverse array of numerical illustrations are presented, along with the MATLAB code for reproducing the results in a publicly available at repository https://github.com/katlund/BlockStab. A number of open problems are discussed, and an appendix containing all algorithms type-set in a uniform fashion is provided.
BibTeX:
@article{Carson2020c,
  author = {Erin Carson and Kathryn Lund and Miroslav Rozložník and Stephen Thomas},
  title = {An overview of block Gram-Schmidt methods and their stability properties},
  year = {2020}
}
Cartis C, Gould N and Toint PL (2020), "Strong Evaluation Complexity Bounds for Arbitrary-Order Optimization of Nonconvex Nonsmooth Composite Functions", January, 2020.
Abstract: We introduce the concept of strong high-order approximate minimizers for nonconvex optimization problems. These apply in both standard smooth and composite non-smooth settings, and additionally allow convex or inexpensive constraints. An adaptive regularization algorithm is then proposed to find such approximate minimizers. Under suitable Lipschitz continuity assumptions, whenever the feasible set is convex, it is shown that using a model of degree p, this algorithm will find a strong approximate q-th-order minimizer in at most cal O(_1≤ j≤ q_j^-(p+1)/(p-j+1)) evaluations of the problem's functions and their derivatives, where _j is the j-th order accuracy tolerance; this bound applies when either q=1 or the problem is not composite with q ≤ 2. For general non-composite problems, even when the feasible set is nonconvex, the bound becomes cal O(_1≤ j≤ q_j^-q(p+1)/p) evaluations. If the problem is composite, and either q > 1 or the feasible set is not convex, the bound is then cal O(_1≤ j≤ q_j^-(q+1)) evaluations. These results not only provide, to our knowledge, the first known bound for (unconstrained or inexpensively-constrained) composite problems for optimality orders exceeding one, but also give the first sharp bounds for high-order strong approximate q-th order minimizers of standard (unconstrained and inexpensively constrained) smooth problems, thereby complementing known results for weak minimizers.
BibTeX:
@article{Cartis2020,
  author = {Coralia Cartis and Nick Gould and Philippe L. Toint},
  title = {Strong Evaluation Complexity Bounds for Arbitrary-Order Optimization of Nonconvex Nonsmooth Composite Functions},
  year = {2020}
}
Cartis C, Gould NIM and Toint PL (2020), "Strong Evaluation Complexity of An Inexact Trust-Region Algorithm for Arbitrary-Order Unconstrained Nonconvex Optimization", November, 2020.
Abstract: A trust-region algorithm using inexact function and derivatives values is introduced for solving unconstrained smooth optimization problems. This algorithm uses high-order Taylor models and allows the search of strong approximate minimizers of arbitrary order. The evaluation complexity of finding a q-th approximate minimizer using this algorithm is then shown, under standard conditions, to be 𝒪(_j∊1,\ldots,q\_j^-(q+1)) where the _j are the order-dependent requested accuracy thresholds. Remarkably, this order is identical to that of classical trust-region methods using exact information.
BibTeX:
@article{Cartis2020a,
  author = {C. Cartis and N. I. M. Gould and Ph. L. Toint},
  title = {Strong Evaluation Complexity of An Inexact Trust-Region Algorithm for Arbitrary-Order Unconstrained Nonconvex Optimization},
  year = {2020}
}
Champion C, Mélanie B, Rémy B, Jean-Michel L and Laurent R (2020), "Robust spectral clustering using LASSO regularization", April, 2020.
Abstract: Cluster structure detection is a fundamental task for the analysis of graphs, in order to understand and to visualize their functional characteristics. Among the different cluster structure detection methods, spectral clustering is currently one of the most widely used due to its speed and simplicity. Yet, there are few theoretical guarantee to recover the underlying partitions of the graph for general models. This paper therefore presents a variant of spectral clustering, called 1-spectral clustering, performed on a new random model closely related to stochastic block model. Its goal is to promote a sparse eigenbasis solution of a 1 minimization problem revealing the natural structure of the graph. The effectiveness and the robustness to small noise perturbations of our technique is confirmed through a collection of simulated and real data examples.
BibTeX:
@article{Champion2020,
  author = {Camille Champion and Blazère Mélanie and Burcelin Rémy and Loubes Jean-Michel and Risser Laurent},
  title = {Robust spectral clustering using LASSO regularization},
  year = {2020}
}
Chang TH (2020), "Mathematical Software for Multiobjective Optimization Problems". Thesis at: Virginia Polytechnic Institute and State University.
Abstract: In this thesis, two distinct problems in data-driven computational science are considered. The main problem of interest is the multiobjective optimization problem, where the tradeoff surface (called the Pareto front) between multiple conflicting objectives must be approximated in order to identify designs that balance real-world tradeoffs. In order to solve multiobjective optimization problems that are derived from computationally expensive blackbox functions, such as engineering design optimization problems, several methodologies are combined, including surrogate modeling, trust region methods, and adaptive weighting. The result is a numerical software package that finds approximately Pareto optimal solutions that are evenly distributed across the Pareto front, using minimal cost function evaluations. The second problem of interest is the closely related problem of multivariate interpolation, where an unknown response surface representing an underlying phenomenon is approximated by finding a function that exactly matches available data. To solve the interpolation problem, a novel algorithm is proposed for computing only a sparse subset of the elements in the Delaunay triangulation, as needed to compute the Delaunay interpolant. For high-dimensional data, this reduces the time and space complexity of Delaunay interpolation from exponential time to polynomial time in practice. For each of the above problems, both serial and parallel implementations are described. Additionally, both solutions are demonstrated on real-world problems in computer system performance modeling.
BibTeX:
@phdthesis{Chang2020,
  author = {Tyler H. Chang},
  title = {Mathematical Software for Multiobjective Optimization Problems},
  school = {Virginia Polytechnic Institute and State University},
  year = {2020}
}
Chatzidimitriou A and Gizopoulos D (2020), "rACE: Reverse-Order Processor Reliability Analysis", In Proceedings of the 2020 Design, Automation Test in Europe Conference Exhibition., 3, 2020. , pp. 1115-1120.
Abstract: Modern microprocessors suffer from increased error rates that come along with fabrication technology scaling. Processor designs continuously become more prone to hardware faults that lead to execution errors and system failures, which raise the requirement of protection mechanisms. However, error mitigation strategies have to be applied diligently, as they impose significant power, area, and performance overheads. Early and accurate reliability estimation of a microprocessor design is essential in order to determine the most vulnerable hardware structures and the most efficient protection schemes. One of the most commonly used techniques for reliability estimation is Architecturally Correct Execution (ACE) analysis.ACE analysis can be applied at different abstraction models, including microarchitecture and RTL and often requires a single or few simulations to report the Architectural Vulnerability Factor (AVF) of the processor structures. However, ACE analysis overestimates the vulnerability of structures because of its pessimistic, worst-case nature. Moreover, it only delivers coarse-grain vulnerability reports and no details about the expected result of hardware faults (silent data corruptions, crashes). In this paper, we present reverse ACE (rACE), a methodology that (a) improves the accuracy of ACE analysis and (b) delivers fine-grain error outcome reports. Using a reverse-order tracing flow, rACE analysis associates portions of the simulated execution of a program with the actual output and the control flow, delivering finer accuracy and results classification. Our findings show that rACE reports an average 1.45× overestimation, compared to Statistical Fault Injection, for different sizes of the register file of an out-of-order CPU core (executing both ARM and x86 binaries), when a baseline ACE analysis reports 2.3× overestimation and even refined versions of ACE analysis report an average of 1.8× overestimation.
BibTeX:
@inproceedings{Chatzidimitriou2020,
  author = {A. Chatzidimitriou and D. Gizopoulos},
  title = {rACE: Reverse-Order Processor Reliability Analysis},
  booktitle = {Proceedings of the 2020 Design, Automation Test in Europe Conference Exhibition},
  year = {2020},
  pages = {1115--1120},
  doi = {10.23919/DATE48585.2020.9116355}
}
Chen Y, Wang S, Zheng F and Cen Y (2020), "Graph-regularized least squares regression for multi-view subspace clustering", Knowledge-Based Systems., 1, 2020. , pp. 105482. Elsevier BV.
Abstract: Many works have proven that the consistency and differences in multi-view subspace clustering make the clustering results better than the single-view clustering. Therefore, this paper studies the multi-view clustering problem, which aims to divide data points into several groups using multiple features. However, existing multi-view clustering methods fail to capturing the grouping effect and local geometrical structure of the multiple features. In order to solve these problems, this paper proposes a novel multi-view subspace clustering model called graph-regularized least squares regression (GLSR), which uses not only the least squares regression instead of the nuclear norm to generate grouping effect, but also the manifold constraint to preserve the local geometrical structure of multiple features. Specifically, the proposed GLSR method adopts the least squares regression to learn the globally consensus information shared by multiple views and the column-sparsity norm to measure the residual information. Under the alternating direction method of multipliers framework, an effective method is developed by iteratively update all variables. Numerical studies on eight real databases demonstrate the effectiveness and superior performance of the proposed GLSR over eleven state-of-the-art methods.
BibTeX:
@article{Chen2020,
  author = {Yongyong Chen and Shuqin Wang and Fangying Zheng and Yigang Cen},
  title = {Graph-regularized least squares regression for multi-view subspace clustering},
  journal = {Knowledge-Based Systems},
  publisher = {Elsevier BV},
  year = {2020},
  pages = {105482},
  doi = {10.1016/j.knosys.2020.105482}
}
Chen T, Lasserre J-B, Magron V and Pauwels E (2020), "Polynomial Optimization for Bounding Lipschitz Constants of Deep Networks", February, 2020.
Abstract: The Lipschitz constant of a network plays an important role in many applications of deep learning, such as robustness certification and Wasserstein Generative Adversarial Network. We introduce a semidefinite programming hierarchy to estimate the global and local Lipschitz constant of a multiple layer deep neural network. The novelty is to combine a polynomial lifting for ReLU functions derivatives with a weak generalization of Putinar's positivity certificate. This idea could also apply to other, nearly sparse, polynomial optimization problems in machine learning. We empirically demonstrate that our method not only runs faster than state-of-the-art linear programming based method, but also provides sharper bounds.
BibTeX:
@article{Chen2020a,
  author = {Tong Chen and Jean-Bernard Lasserre and Victor Magron and Edouard Pauwels},
  title = {Polynomial Optimization for Bounding Lipschitz Constants of Deep Networks},
  year = {2020}
}
Chen Y, Xiao G, Wu F, Tang Z and Li K (2020), "tpSpMV: A Two-Phase Large-scale Sparse Matrix-Vector Multiplication Kernel for Manycore Architectures", Information Sciences., March, 2020. Elsevier BV.
Abstract: Sparse matrix-vector multiplication (SpMV) is one of the important subroutines in numerical linear algebras widely used in lots of large-scale applications. Accelerating SpMV on multicore and manycore architectures based on Compressed Sparse Row (CSR) format via row-wise parallelization is one of the most popular directions. However, there are three main challenges in optimizing parallel CSR-based SpMV: (a) limited local memory of each computing unit can be overwhelmed by assignments to long rows of large-scale sparse matrices; (b) irregular accesses to the input vector result in expensive memory access latency; (c) sparse data structure leads to low bandwidth usage. This paper proposes a two-phase large-scale SpMV, called tpSpMV, based on the memory structure and computing architecture of multicore and manycore architectures to alleviate the three main difficulties. First, we propose the two-phase parallel execution technique for tpSpMV that performs parallel CSR-based SpMV into two separate phases to overcome the computational scale limitation. Second, we respectively propose the adaptive partitioning methods and parallelization designs using the local memory caching technique for the two phases to exploit the architectural advantages of the high-performance computing platforms and alleviate the problem of high memory access latency. Third, we design several optimizations, such as data reduction, aligned memory accessing, and pipeline technique, to improve bandwidth usage and optimize tpSpMV's performance. Experimental results on SW26010 CPUs of the Sunway TaihuLight supercomputer prove that tpSpMV achieves up to 28.61 speedups and yields the performance improvement of 13.16% over the state-of-the-art work on average.
BibTeX:
@article{Chen2020b,
  author = {Yuedan Chen and Guoqing Xiao and Fan Wu and Zhuo Tang and Keqin Li},
  title = {tpSpMV: A Two-Phase Large-scale Sparse Matrix-Vector Multiplication Kernel for Manycore Architectures},
  journal = {Information Sciences},
  publisher = {Elsevier BV},
  year = {2020},
  doi = {10.1016/j.ins.2020.03.020}
}
Chen Y, Xiao G, Ozsu MT, Liu C, Zomaya A and Li T (2020), "aeSpTV: An Adaptive and Efficient Framework for Sparse Tensor-Vector Product Kernel on a High-Performance Computing Platform", IEEE Transactions on Parallel and Distributed Systems. , pp. 1-1. Institute of Electrical and Electronics Engineers (IEEE).
Abstract: Multi-dimensional, large-scale, and sparse data, which can be neatly represented by sparse tensors, are increasingly used in various applications such as data analysis and machine learning. A high-performance sparse tensor-vector product (SpTV), one of the most fundamental operations of processing sparse tensors, is necessary for improving efficiency of related applications. In this paper, we propose aeSpTV, an adaptive and efficient SpTV framework on Sunway TaihuLight supercomputer, to solve several challenges of optimizing SpTV on high-performance computing platforms. First, to map SpTV to Sunway architecture and tame expensive memory access latency and parallel writing conflict due to the intrinsic irregularity of SpTV, we introduce an adaptive SpTV parallelization. Second, to co-execute with the parallelization design while still ensuring high efficiency, we design a sparse tensor data structure named CSSoCR. Third, based on the adaptive SpTV parallelization with the novel tensor data structure, we present an auto-tuner that chooses the most befitting tensor partitioning method for aeSpTV using the variance analysis theory of mathematical statistics to achieve load balance. Fourth, to further leverage the computing power of Sunway, we propose customized optimizations for aeSpTV. Experimental results show that aeSpTV yields good sacalability on both thread-level and process-level parallelism of Sunway. It achieves a maximum GFLOPS of 195.69 on 128 processes. Additionally, it is proved that optimization effects of the partitioning auto-tuner and optimization techniques are remarkable.
BibTeX:
@article{Chen2020c,
  author = {Yuedan Chen and Guoqing Xiao and M. Tamer Ozsu and Chubo Liu and Albert Zomaya and Tao Li},
  title = {aeSpTV: An Adaptive and Efficient Framework for Sparse Tensor-Vector Product Kernel on a High-Performance Computing Platform},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
  year = {2020},
  pages = {1--1},
  doi = {10.1109/tpds.2020.2990429}
}
Chen L, Hu X and Wu H (2020), "Randomized Fast Subspace Descent Methods", June, 2020.
Abstract: Randomized Fast Subspace Descent (RFASD) Methods are developed and analyzed for smooth and non-constraint convex optimization problems. The efficiency of the method relies on a space decomposition which is stable in A-norm, and meanwhile, the condition number _A measured in A-norm is small. At each iteration, the subspace is chosen randomly either uniformly or by a probability proportional to the local Lipschitz constants. Then in each chosen subspace, a preconditioned gradient descent method is applied. RFASD converges sublinearly for convex functions and linearly for strongly convex functions. Comparing with the randomized block coordinate descent methods, the convergence of RFASD is faster provided _A is small and the subspace decomposition is A-stable. This improvement is supported by considering a multilevel space decomposition for Nesterov's `worst' problem.
BibTeX:
@article{Chen2020d,
  author = {Long Chen and Xiaozhe Hu and Huiwen Wu},
  title = {Randomized Fast Subspace Descent Methods},
  year = {2020}
}
Chen S, Ma S, Xue L and Zou H (2020), "An Alternating Manifold Proximal Gradient Method for Sparse Principal Component Analysis and Sparse Canonical Correlation Analysis", INFORMS Journal on Optimization., 7, 2020. , pp. ijoo.2019.0032. Institute for Operations Research and the Management Sciences (INFORMS).
Abstract: Sparse principal component analysis and sparse canonical correlation analysis are two essential techniques from high-dimensional statistics and machine learning for analyzing large-scale data. Both problems can be formulated as an optimization problem with nonsmooth objective and nonconvex constraints. Because nonsmoothness and nonconvexity bring numerical difficulties, most algorithms suggested in the literature either solve some relaxations of them or are heuristic and lack convergence guarantees. In this paper, we propose a new alternating manifold proximal gradient method to solve these two high-dimensional problems and provide a unified convergence analysis. Numerical experimental results are reported to demonstrate the advantages of our algorithm.
BibTeX:
@article{Chen2020e,
  author = {Shixiang Chen and Shiqian Ma and Lingzhou Xue and Hui Zou},
  title = {An Alternating Manifold Proximal Gradient Method for Sparse Principal Component Analysis and Sparse Canonical Correlation Analysis},
  journal = {INFORMS Journal on Optimization},
  publisher = {Institute for Operations Research and the Management Sciences (INFORMS)},
  year = {2020},
  pages = {ijoo.2019.0032},
  doi = {10.1287/ijoo.2019.0032}
}
Chen C, Liang T and Biros G (2020), "RCHOL: Randomized Cholesky Factorization for Solving SDD Linear Systems", November, 2020.
Abstract: We introduce a randomized algorithm, namely tt rchol, to construct an approximate Cholesky factorization for a given sparse Laplacian matrix (a.k.a., graph Laplacian). The (exact) Cholesky factorization for the matrix introduces a clique in the associated graph after eliminating every row/column. By randomization, tt rchol samples a subset of the edges in the clique. We prove tt rchol is breakdown free and apply it to solving linear systems with symmetric diagonally-dominant matrices. In addition, we parallelize tt rchol based on the nested dissection ordering for shared-memory machines. Numerical experiments demonstrated the robustness and the scalability of tt rchol. For example, our parallel code scaled up to 64 threads on a single node for solving the 3D Poisson equation, discretized with the 7-point stencil on a 1024× 1024 × 1024 grid, or one billion unknowns.
BibTeX:
@article{Chen2020f,
  author = {Chao Chen and Tianyu Liang and George Biros},
  title = {RCHOL: Randomized Cholesky Factorization for Solving SDD Linear Systems},
  year = {2020}
}
Cheng Y, Panigrahi D and Sun K (2020), "Sparsification of Balanced Directed Graphs", June, 2020.
Abstract: Sparsification, where the cut values of an input graph are approximately preserved by a sparse graph (called a cut sparsifier) or a succinct data structure (called a cut sketch), has been an influential tool in graph algorithms. But, this tool is restricted to undirected graphs, because some directed graphs are known to not admit sparsification. Such examples, however, are structurally very dissimilar to undirected graphs in that they exhibit highly unbalanced cuts. This motivates us to ask: can we sparsify a balanced digraph? To make this question concrete, we define balance β of a digraph as the maximum ratio of the cut value in the two directions (Ene et al., STOC 2016). We show the following results: For-All Sparsification: If all cut values need to be simultaneously preserved (cf. Benczúr and Karger, STOC 1996), then we show that the size of the sparsifier (or even cut sketch) must scale linearly with β. The upper bound is a simple extension of sparsification of undirected graphs (formally stated recently in Ikeda and Tanigawa (WAOA 2018)), so our main contribution here is to show a matching lower bound. For-Each Sparsification: If each cut value needs to be individually preserved (Andoni et al., ITCS 2016), then the situation is more interesting. Here, we give a cut sketch whose size scales with \beta, thereby beating the linear lower bound above. We also show that this result is tight by exhibiting a matching lower bound of \beta on "for-each" cut sketches. Our upper bounds work for general weighted graphs, while the lower bounds even hold for unweighted graphs with no parallel edges.
BibTeX:
@article{Cheng2020,
  author = {Yu Cheng and Debmalya Panigrahi and Kevin Sun},
  title = {Sparsification of Balanced Directed Graphs},
  year = {2020}
}
Chennupati G, Santhi N, Romero P and Eidenbenz S (2020), "Machine Learning Enabled Scalable Performance Prediction of Scientific Codes", October, 2020.
Abstract: We present the Analytical Memory Model with Pipelines (AMMP) of the Performance Prediction Toolkit (PPT). PPT-AMMP takes high-level source code and hardware architecture parameters as input, predicts runtime of that code on the target hardware platform, which is defined in the input parameters. PPT-AMMP transforms the code to an (architecture-independent) intermediate representation, then (i) analyzes the basic block structure of the code, (ii) processes architecture-independent virtual memory access patterns that it uses to build memory reuse distance distribution models for each basic block, (iii) runs detailed basic-block level simulations to determine hardware pipeline usage. PPT-AMMP uses machine learning and regression techniques to build the prediction models based on small instances of the input code, then integrates into a higher-order discrete-event simulation model of PPT running on Simian PDES engine. We validate PPT-AMMP on four standard computational physics benchmarks, finally present a use case of hardware parameter sensitivity analysis to identify bottleneck hardware resources on different code inputs. We further extend PPT-AMMP to predict the performance of scientific application (radiation transport), SNAP. We analyze the application of multi-variate regression models that accurately predict the reuse profiles and the basic block counts. The predicted runtimes of SNAP when compared to that of actual times are accurate.
BibTeX:
@article{Chennupati2020,
  author = {Gopinath Chennupati and Nandakishore Santhi and Phill Romero and Stephan Eidenbenz},
  title = {Machine Learning Enabled Scalable Performance Prediction of Scientific Codes},
  year = {2020}
}
Cheramangalath U, Nasre R and Srikant YN (2020), "Graph Analytics Frameworks", In Distributed Graph Analytics. , pp. 99-122. Springer International Publishing.
Abstract: Frameworks take away the drudgery of routine tasks in programming graph analytic applications. This chapter describes in some detail, the different models of execution that are used in graph analytics, such as BSP, Map-Reduce, asynchronous execution, GAS, Inspector-Executor, and Advance-Filter-Compute. It also provides a glimpse of different existing frameworks on multi-core CPUs, GPUs, and distributed systems.
BibTeX:
@incollection{Cheramangalath2020,
  author = {Unnikrishnan Cheramangalath and Rupesh Nasre and Y. N. Srikant},
  title = {Graph Analytics Frameworks},
  booktitle = {Distributed Graph Analytics},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {99--122},
  doi = {10.1007/978-3-030-41886-1_4}
}
Cherubin S, Cattaneo D, Chiari M and Agosta G (2020), "Dynamic Precision Autotuning with TAFFO", ACM Transactions on Architecture and Code Optimization., 5, 2020. Vol. 17(2), pp. 1-26. Association for Computing Machinery (ACM).
Abstract: Many classes of applications, both in the embedded and high performance domains, can trade off the accuracy of the computed results for computation performance. One way to achieve such a trade-off is precision tuning—that is, to modify the data types used for the computation by reducing the bit width, or by changing the representation from floating point to fixed point. We present a methodology for high-accuracy dynamic precision tuning based on the identification of input classes (i.e., classes of input datasets that benefit from similar optimizations). When a new input region is detected, the application kernels are re-compiled on the fly with the appropriate selection of parameters. In this way, we obtain a continuous optimization approach that enables the exploitation of the reduced precision computation while progressively exploring the solution space, thus reducing the time required by compilation overheads. We provide tools to support the automation of the runtime part of the solution, leaving to the user only the task of identifying the input classes. Our approach provides a significant performance boost (up to 320%) on the typical approximate computing benchmarks, without meaningfully affecting the accuracy of the result, since the error remains always below 3%.
BibTeX:
@article{Cherubin2020,
  author = {Stefano Cherubin and Daniele Cattaneo and Michele Chiari and Giovanni Agosta},
  title = {Dynamic Precision Autotuning with TAFFO},
  journal = {ACM Transactions on Architecture and Code Optimization},
  publisher = {Association for Computing Machinery (ACM)},
  year = {2020},
  volume = {17},
  number = {2},
  pages = {1--26},
  url = {https://dl.acm.org/doi/pdf/10.1145/3388785},
  doi = {10.1145/3388785}
}
Chevalier C, Ledoux F and Morais S (2020), "A Multilevel Mesh Partitioning Algorithm Driven by Memory Constraints", In Proceedings of the SIAM Workshop on Combinatorial Scientific Computing., 1, 2020. , pp. 85-95. Society for Industrial and Applied Mathematics.
Abstract: Running numerical simulations on HPC architectures requires distributing data to be processed over the various available processing units. This task is usually done by partitioning tools, whose primary goal is to balance the workload while minimizing inter-process communication. However, they do not take the memory load and memory capacity of the processing units into account. As this can lead to memory overflow, we propose a new approach to address mesh partitioning by including ghost cells in the memory usage and by considering memory capacity as a strong constraint to abide. We model the problem using a bipartite graph and present a new greedy algorithm that aims at producing a partition according to the memory capacity. This algorithm focuses on memory consumption, and we use it in a multi-level approach to improving the quality of the returned solutions during the refinement phase. The experimental results obtained from our benchmarks show that our approach can yield solutions respecting memory constraints for instances where traditional partitioning tools fail.
BibTeX:
@incollection{Chevalier2020,
  author = {Cédric Chevalier and Franck Ledoux and Sébastien Morais},
  title = {A Multilevel Mesh Partitioning Algorithm Driven by Memory Constraints},
  booktitle = {Proceedings of the SIAM Workshop on Combinatorial Scientific Computing},
  publisher = {Society for Industrial and Applied Mathematics},
  year = {2020},
  pages = {85--95},
  doi = {10.1137/1.9781611976229.9}
}
Chieu NH, Hien LV and Trang NTQ (2020), "Tilt Stability for Quadratic Programs with One or Two Quadratic Inequality Constraints", Acta Mathematica Vietnamica., 6, 2020. Vol. 45(2), pp. 477-499. Springer Science and Business Media LLC.
Abstract: This paper examines tilt stability for quadratic programs with one or two quadratic inequality constraints. Exploiting specific features of these problems and using some known results on tilt stability in nonlinear programming, we establish quite simple characterizations of tilt-stable local minimizers for quadratic programs with one quadratic inequality constraint under metric subregularity constraint qualification. By the same way, we also derive various tilt stability conditions for quadratic programs with two quadratic inequality constraints and satisfying certain suitable assumptions. Especially, the obtained results show that some tilt stability conditions only known to be sufficient in nonlinear programming become the necessary ones when the considered problems are quadratic programs with one or two quadratic inequality constraints.
BibTeX:
@article{Chieu2020,
  author = {Nguyen Huy Chieu and Le Van Hien and Nguyen Thi Quynh Trang},
  title = {Tilt Stability for Quadratic Programs with One or Two Quadratic Inequality Constraints},
  journal = {Acta Mathematica Vietnamica},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  volume = {45},
  number = {2},
  pages = {477--499},
  doi = {10.1007/s40306-020-00372-4}
}
Chilukuri A, Milthorpe J and Johnston B (2020), "Characterizing Optimizations to Memory Access Patterns using Architecture-Independent Program Features", March, 2020.
Abstract: High-performance computing developers are faced with the challenge of optimizing the performance of OpenCL workloads on diverse architectures. The Architecture-Independent Workload Characterization (AIWC) tool is a plugin for the Oclgrind OpenCL simulator that gathers metrics of OpenCL programs that can be used to understand and predict program performance on an arbitrary given hardware architecture. However, AIWC metrics are not always easily interpreted and do not reflect some important memory access patterns affecting efficiency across architectures. We propose a new metric of parallel spatial locality -- the closeness of memory accesses simultaneously issued by OpenCL work-items (threads). We implement the parallel spatial locality metric in the AIWC framework, and analyse gathered results on matrix multiply and the Extended OpenDwarfs OpenCL benchmarks. The differences in the observed parallel spatial locality metric across implementations of matrix multiply reflect the optimizations performed. The new metric can be used to distinguish between the OpenDwarfs benchmarks based on the memory access patterns affecting their performance on various architectures. The improvements suggested to AIWC will help HPC developers better understand memory access patterns of complex codes and guide optimization of codes for arbitrary hardware targets.
BibTeX:
@article{Chilukuri2020,
  author = {Aditya Chilukuri and Josh Milthorpe and Beau Johnston},
  title = {Characterizing Optimizations to Memory Access Patterns using Architecture-Independent Program Features},
  year = {2020}
}
Choi J, Richards DF and Kale LV (2020), "Achieving Computation-Communication Overlapwith Overdecomposition on GPU Systems", In Proceedings of the 2020 IEEE/ACM nternational Workshop on Extreme Scale Programming Models and Middleware.
Abstract: The landscape of high performance computing is shifting towards a collection of multi-GPU nodes, widening the gap between on-node compute and off-node communication capabilities. Consequently, the ability to tolerate communication latencies and maximize utilization of the compute hardware are becoming increasingly important in achieving high performance. Overdecomposition has been successfully adopted on traditional CPU-based systems to achieve computation-communication overlap, significantly reducing the impact of communication on application performance. However, it has been unclear whether overdecomposition can provide the same benefits on modern GPU systems. In this work, we address the challenges in achieving computation-communication overlap with overdecomposition on GPU systems using the Charm++ parallel programming system. By prioritizing communication with CUDA streams in the application and supporting asynchronous progress of GPU operations in the Charm++ runtime system, we obtain improvements in overall performance of up to 50% and 47% with proxy applications Jacobi3D and MiniMD, respectively.
BibTeX:
@inproceedings{Choi2020,
  author = {Jaemin Choi and David F. Richards and Laxmikant V. Kale},
  title = {Achieving Computation-Communication Overlapwith Overdecomposition on GPU Systems},
  booktitle = {Proceedings of the 2020 IEEE/ACM nternational Workshop on Extreme Scale Programming Models and Middleware},
  year = {2020}
}
Choi J, Richards DF and Kale LV (2020), "Achieving Computation-Communication Overlapwith Overdecomposition on GPU Systems", In Proceedings of the IEEE/ACM 5th International Workshop on Extreme Scale Programming Models and Middleware.
Abstract: The landscape of high performance computing is shifting towards a collection of multi-GPU nodes, widening the gap between on-node compute and off-node communication capabilities. Consequently, the ability to tolerate communication latencies and maximize utilization of the compute hardware are becoming increasingly important in achieving high performance. Overdecomposition has been successfully adopted on traditional CPU-based systems to achieve computation-communication overlap, significantly reducing the impact of communication on application performance. However, it has been unclear whether overdecomposition can provide the same benefits on modern GPU systems. In this work, we address the challenges in achieving computation-communication overlap with overdecomposition on GPU systems using the Charm++ parallel programming system. By prioritizing communication with CUDA streams in the application and supporting asynchronous progress of GPU operations in the Charm++ runtime system, we obtain improvements in overall performance of up to 50% and 47% with proxy applications Jacobi3D and MiniMD, respectively.
BibTeX:
@inproceedings{Choi2020a,
  author = {Jaemin Choi and David F. Richards and Laxmikant V. Kale},
  title = {Achieving Computation-Communication Overlapwith Overdecomposition on GPU Systems},
  booktitle = {Proceedings of the IEEE/ACM 5th International Workshop on Extreme Scale Programming Models and Middleware},
  year = {2020},
  url = {https://conferences.computer.org/scwpub/pdfs/ESPM22020-3LkQkprM0X1ItzzieZeAlw/107400a001/107400a001.pdf}
}
Chou S, Kjolstad F and Amarasinghe S (2020), "Automatic Generation of Efficient Sparse Tensor Format Conversion Routines", January, 2020.
Abstract: This paper shows how to generate code that efficiently converts sparse tensors between disparate storage formats (data layouts) like CSR, DIA, ELL, and many others. We decompose sparse tensor conversion into three logical phases: coordinate remapping, analysis, and assembly. We then develop a language that precisely describes how different formats group together and order a tensor's nonzeros in memory. This enables a compiler to emit code that performs complex reorderings (remappings) of nonzeros when converting between formats. We additionally develop a query language that can extract complex statistics about sparse tensors, and we show how to emit efficient analysis code that computes such queries. Finally, we define an abstract interface that captures how data structures for storing a tensor can be efficiently assembled given specific statistics about the tensor. Disparate formats can implement this common interface, thus letting a compiler emit optimized sparse tensor conversion code for arbitrary combinations of a wide range of formats without hard-coding for any specific one. Our evaluation shows that our technique generates sparse tensor conversion routines with performance between 0.99 and 2.2× that of hand-optimized implementations in two widely used sparse linear algebra libraries, SPARSKIT and Intel MKL. By emitting code that avoids materializing temporaries, our technique also outperforms both libraries by between 1.4 and 3.4× for CSC/COO to DIA/ELL conversion.
BibTeX:
@article{Chou2020,
  author = {Stephen Chou and Fredrik Kjolstad and Saman Amarasinghe},
  title = {Automatic Generation of Efficient Sparse Tensor Format Conversion Routines},
  year = {2020}
}
Chowdhury A, London P, Avron H and Drineas P (2020), "Speeding up Linear Programming using Randomized Linear Algebra", March, 2020.
Abstract: Linear programming (LP) is an extremely useful tool and has been successfully applied to solve various problems in a wide range of areas, including operations research, engineering, economics, or even more abstract mathematical areas such as combinatorics. It is also used in many machine learning applications, such as l_1-regularized SVMs, basis pursuit, nonnegative matrix factorization, etc. Interior Point Methods (IPMs) are one of the most popular methods to solve LPs both in theory and in practice. Their underlying complexity is dominated by the cost of solving a system of linear equations at each iteration. In this paper, we consider infeasible IPMs for the special case where the number of variables is much larger than the number of constraints. Using tools from Randomized Linear Algebra, we present a preconditioning technique that, when combined with the Conjugate Gradient iterative solver, provably guarantees that infeasible IPM algorithms (suitably modified to account for the error incurred by the approximate solver), converge to a feasible, approximately optimal solution, without increasing their iteration complexity. Our empirical evaluations verify our theoretical results on both real-world and synthetic data.
BibTeX:
@article{Chowdhury2020,
  author = {Agniva Chowdhury and Palma London and Haim Avron and Petros Drineas},
  title = {Speeding up Linear Programming using Randomized Linear Algebra},
  year = {2020}
}
Christlieb AJ, Guthrey PT, Sands WA and Thavappiragasm M (2020), "Parallel Algorithms for Successive Convolution", July, 2020.
Abstract: In this work, we consider alternative discretizations for PDEs which use expansions involving integral operators to approximate spatial derivatives. These constructions use explicit information within the integral terms, but treat boundary data implicitly, which contributes to the overall speed of the method. This approach is provably unconditionally stable for linear problems and stability has been demonstrated experimentally for nonlinear problems. Additionally, it is matrix-free in the sense that it is not necessary to invert linear systems and iteration is not required for nonlinear terms. Moreover, the scheme employs a fast summation algorithm that yields a method with a computational complexity of 𝒪(N), where N is the number of mesh points along a direction. While much work has been done to explore the theory behind these methods, their practicality in large scale computing environments is a largely unexplored topic. In this work, we explore the performance of these methods by developing a domain decomposition algorithm suitable for distributed memory systems along with shared memory algorithms. As a first pass, we derive an artificial CFL condition that enforces a nearest-neighbor communication pattern and briefly discuss possible generalizations. We also analyze several approaches for implementing the parallel algorithms by optimizing predominant loop structures and maximizing data reuse. Using a hybrid design that employs MPI and Kokkos for the distributed and shared memory components of the algorithms, respectively, we show that our methods are efficient and can sustain an update rate > 110^8 DOF/node/s. We provide results that demonstrate the scalability and versatility of our algorithms using several different PDE test problems, including a nonlinear example, which employs an adaptive time-stepping rule.
BibTeX:
@article{Christlieb2020,
  author = {Andrew J. Christlieb and Pierson T. Guthrey and William A. Sands and Mathialakan Thavappiragasm},
  title = {Parallel Algorithms for Successive Convolution},
  year = {2020}
}
Cojean T, Tsai Y-H"M and Anzt H (2020), "Ginkgo -- A Math Library designed for Platform Portability", November, 2020.
Abstract: The first associations to software sustainability might be the existence of a continuous integration (CI) framework; the existence of a testing framework composed of unit tests, integration tests, and end-to-end tests; and also the existence of software documentation. However, when asking what is a common deathblow for a scientific software product, it is often the lack of platform and performance portability. Against this background, we designed the Ginkgo library with the primary focus on platform portability and the ability to not only port to new hardware architectures, but also achieve good performance. In this paper we present the Ginkgo library design, radically separating algorithms from hardware-specific kernels forming the distinct hardware executors, and report our experience when adding execution backends for NVIDIA, AMD, and Intel GPUs. We also comment on the different levels of performance portability, and the performance we achieved on the distinct hardware backends.
BibTeX:
@article{Cojean2020,
  author = {Terry Cojean and Yu-Hsiang "Mike" Tsai and Hartwig Anzt},
  title = {Ginkgo -- A Math Library designed for Platform Portability},
  year = {2020}
}
Connolly MP, Higham NJ and Mary T (2020), "Stochastic Rounding and its Probabilistic Backward Error Analysis"
Abstract: Stochastic rounding rounds a real number to the next larger or smaller floatingpoint number with probabilities 1 minus the relative distances to those numbers. It is gaining attention in deep learning because it can improve the accuracy of the computations. We compare basic properties of stochastic rounding with those for round to nearest, finding properties in common as well as significant differences. We prove that for stochastic rounding the rounding errors are mean independent random variables with zero mean. We derive a new version of our probabilistic error analysis theorem from [SIAM J. Sci. Comput., 41 (2019), pp. A2815–A2835], weakening the assumption of independence of the random variables to mean independence. These results imply that for a wide range of linear algebra computations the backward error for stochastic rounding is unconditionally bounded by a multiple of nu to first order, with a certain probability, where n is the problem size and u is the unit roundoff. This is the first scenario where the rule of thumb that one can replace nu by nu in a rounding error bound has been shown to hold without any additional assumptions on the rounding errors. We also explain how stochastic rounding avoids the phenomenon of stagnation in sums, whereby small addends are obliterated by round to nearest when they are too small relative to the sum.
BibTeX:
@article{Connolly2020,
  author = {Connolly, Michael P. and Higham, Nicholas J. and Mary, Theo},
  title = {Stochastic Rounding and its Probabilistic Backward Error Analysis},
  year = {2020},
  url = {http://eprints.maths.manchester.ac.uk/2763/1/paper.pdf}
}
Constantinides GA (2020), "Rethinking arithmetic for deep neural networks", Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences., 1, 2020. Vol. 378(2166), pp. 20190051. The Royal Society.
Abstract: We consider efficiency in the implementation of deep neural networks. Hardware accelerators are gaining interest as machine learning becomes one of the drivers of high-performance computing. In these accelerators, the directed graph describing a neural network can be implemented as a directed graph describing a Boolean circuit. We make this observation precise, leading naturally to an understanding of practical neural networks as discrete functions, and show that the so-called binarized neural networks are functionally complete. In general, our results suggest that it is valuable to consider Boolean circuits as neural networks, leading to the question of which circuit topologies are promising. We argue that continuity is central to generalization in learning, explore the interaction between data coding, network topology, and node functionality for continuity and pose some open questions for future research. As a first step to bridging the gap between continuous and Boolean views of neural network accelerators, we present some recent results from our work on LUTNet, a novel Field-Programmable Gate Array inference approach. Finally, we conclude with additional possible fruitful avenues for research bridging the continuous and discrete views of neural networks. \This article is part of a discussion meeting issue "Numerical algorithms for high-performance computational science".
BibTeX:
@article{Constantinides2020,
  author = {G. A. Constantinides},
  title = {Rethinking arithmetic for deep neural networks},
  journal = {Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences},
  publisher = {The Royal Society},
  year = {2020},
  volume = {378},
  number = {2166},
  pages = {20190051},
  doi = {10.1098/rsta.2019.0051}
}
Cornelis J and Vanroose W (2020), "Projected Newton method for noise constrained _p regularization", May, 2020.
Abstract: Choosing an appropriate regularization term is necessary to obtain a meaningful solution to an ill-posed linear inverse problem contaminated with measurement errors or noise. A regularization term in the the _p norm with p≥ 1 covers a wide range of choices since its behavior critically depends on the choice of p and since it can easily be combined with a suitable regularization matrix. We develop an efficient algorithm that simultaneously determines the regularization parameter and corresponding _p regularized solution such that the discrepancy principle is satisfied. We project the problem on a low-dimensional Generalized Krylov subspace and compute the Newton direction for this much smaller problem. We illustrate some interesting properties of the algorithm and compare its performance with other state-of-the-art approaches using a number of numerical experiments, with a special focus of the sparsity inducing _1 norm and edge-preserving total variation regularization.
BibTeX:
@article{Cornelis2020,
  author = {Jeffrey Cornelis and Wim Vanroose},
  title = {Projected Newton method for noise constrained _p regularization},
  year = {2020}
}
Cortiella A, Park K-C and Doostan A (2020), "Sparse Identification of Nonlinear Dynamical Systems via Reweighted _1-regularized Least Squares", May, 2020.
Abstract: This work proposes an iterative sparse-regularized regression method to recover governing equations of nonlinear dynamical systems from noisy state measurements. The method is inspired by the Sparse Identification of Nonlinear Dynamics (SINDy) approach of [Brunton et al., PNAS, 113 (15) (2016) 3932-3937], which relies on two main assumptions: the state variables are known a priori and the governing equations lend themselves to sparse, linear expansions in a (nonlinear) basis of the state variables. The aim of this work is to improve the accuracy and robustness of SINDy in the presence of state measurement noise. To this end, a reweighted _1-regularized least squares solver is developed, wherein the regularization parameter is selected from the corner point of a Pareto curve. The idea behind using weighted _1-norm for regularization -- instead of the standard _1-norm -- is to better promote sparsity in the recovery of the governing equations and, in turn, mitigate the effect of noise in the state variables. We also present a method to recover single physical constraints from state measurements. Through several examples of well-known nonlinear dynamical systems, we demonstrate empirically the accuracy and robustness of the reweighted _1-regularized least squares strategy with respect to state measurement noise, thus illustrating its viability for a wide range of potential applications.
BibTeX:
@article{Cortiella2020,
  author = {Alexandre Cortiella and Kwang-Chun Park and Alireza Doostan},
  title = {Sparse Identification of Nonlinear Dynamical Systems via Reweighted _1-regularized Least Squares},
  year = {2020}
}
Crane R and Roosta F (2020), "DINO: Distributed Newton-Type Optimization Method", June, 2020.
Abstract: We present a novel communication-efficient Newton-type algorithm for finite-sum optimization over a distributed computing environment. Our method, named DINO, overcomes both theoretical and practical shortcomings of similar existing methods. Under minimal assumptions, we guarantee global sub-linear convergence of DINO to a first-order stationary point for general non-convex functions and arbitrary data distribution over the network. Furthermore, for functions satisfying Polyak-Lojasiewicz (PL) inequality, we show that DINO enjoys a linear convergence rate. Our proposed algorithm is practically parameter free, in that it will converge regardless of the selected hyper-parameters, which are easy to tune. Additionally, its sub-problems are simple linear least-squares, for which efficient solvers exist. Numerical simulations demonstrate the efficiency of DINO as compared with similar alternatives.
BibTeX:
@article{Crane2020,
  author = {Rixon Crane and Fred Roosta},
  title = {DINO: Distributed Newton-Type Optimization Method},
  year = {2020}
}
Criscitiello C and Boumal N (2020), "An accelerated first-order method for non-convex optimization on manifolds", August, 2020.
Abstract: We describe the first gradient methods on Riemannian manifolds to achieve accelerated rates in the non-convex case. Under Lipschitz assumptions on the Riemannian gradient and Hessian of the cost function, these methods find approximate first-order critical points strictly faster than regular gradient descent. A randomized version also finds approximate second-order critical points. Both the algorithms and their analyses build extensively on existing work in the Euclidean case. The basic operation consists in running the Euclidean accelerated gradient descent method (appropriately safe-guarded against non-convexity) in the current tangent space, then moving back to the manifold and repeating. This requires lifting the cost function from the manifold to the tangent space, which can be done for example through the Riemannian exponential map. For this approach to succeed, the lifted cost function (called the pullback) must retain certain Lipschitz properties. As a contribution of independent interest, we prove precise claims to that effect, with explicit constants. Those claims are affected by the Riemannian curvature of the manifold, which in turn affects the worst-case complexity bounds for our optimization algorithms.
BibTeX:
@article{Criscitiello2020,
  author = {Chris Criscitiello and Nicolas Boumal},
  title = {An accelerated first-order method for non-convex optimization on manifolds},
  year = {2020}
}
Croci M and Giles MB (2020), "Effects of round-to-nearest and stochastic rounding in the numerical solution of the heat equation in low precision", October, 2020.
Abstract: Motivated by the advent of machine learning, the last few years saw the return of hardware-supported low-precision computing. Computations with fewer digits are faster and more memory and energy efficient, but can be extremely susceptible to rounding errors. An application that can largely benefit from the advantages of low-precision computing is the numerical solution of partial differential equations (PDEs), but a careful implementation and rounding error analysis are required to ensure that sensible results can still be obtained.\ In this paper we study the accumulation of rounding errors in the solution of the heat equation, a proxy for parabolic PDEs, via Runge-Kutta finite difference methods using round-to-nearest (RtN) and stochastic rounding (SR). We demonstrate how to implement the scheme to reduce rounding errors and we derive a priori estimates for local and global rounding errors. Let u be the roundoff unit. While the worst-case local errors are O(u) with respect to the discretization parameters, the RtN and SR error behavior is substantially different. We prove that the RtN solution is discretization, initial condition and precision dependent, and always stagnates for small enough Δ t. Until stagnation, the global error grows like O(uΔ t^-1). In contrast, we show that the leading order errors introduced by SR are zero-mean, independent in space and mean-independent in time, making SR resilient to stagnation and rounding error accumulation. In fact, we prove that for SR the global rounding errors are only O(uΔ t^-1/4) in 1D and are essentially bounded (up to logarithmic factors) in higher dimensions.
BibTeX:
@article{Croci2020,
  author = {Matteo Croci and Michael Bryce Giles},
  title = {Effects of round-to-nearest and stochastic rounding in the numerical solution of the heat equation in low precision},
  year = {2020}
}
Cummins C, Fisches ZV, Ben-Nun T, Hoefler T and Leather H (2020), "ProGraML: Graph-based Deep Learning for Program Optimization and Analysis", March, 2020.
Abstract: The increasing complexity of computing systems places a tremendous burden on optimizing compilers, requiring ever more accurate and aggressive optimizations. Machine learning offers significant benefits for constructing optimization heuristics but there remains a gap between what state-of-the-art methods achieve and the performance of an optimal heuristic. Closing this gap requires improvements in two key areas: a representation that accurately captures the semantics of programs, and a model architecture with sufficient expressiveness to reason about this representation. We introduce ProGraML - Program Graphs for Machine Learning - a novel graph-based program representation using a low level, language agnostic, and portable format; and machine learning models capable of performing complex downstream tasks over these graphs. The ProGraML representation is a directed attributed multigraph that captures control, data, and call relations, and summarizes instruction and operand types and ordering. Message Passing Neural Networks propagate information through this structured representation, enabling whole-program or per-vertex classification tasks. ProGraML provides a general-purpose program representation that equips learnable models to perform the types of program analysis that are fundamental to optimization. To this end, we evaluate the performance of our approach first on a suite of traditional compiler analysis tasks: control flow reachability, dominator trees, data dependencies, variable liveness, and common subexpression detection. On a benchmark dataset of 250k LLVM-IR files covering six source programming languages, ProGraML achieves an average 94.0 F1 score, significantly outperforming the state-of-the-art approaches. We then apply our approach to two high-level tasks - heterogeneous device mapping and program classification - setting new state-of-the-art performance in both.
BibTeX:
@article{Cummins2020,
  author = {Chris Cummins and Zacharias V. Fisches and Tal Ben-Nun and Torsten Hoefler and Hugh Leather},
  title = {ProGraML: Graph-based Deep Learning for Program Optimization and Analysis},
  year = {2020}
}
Curtis FE and Scheinberg K (2020), "Adaptive Stochastic Optimization", January, 2020.
Abstract: Optimization lies at the heart of machine learning and signal processing. Contemporary approaches based on the stochastic gradient method are non-adaptive in the sense that their implementation employs prescribed parameter values that need to be tuned for each application. This article summarizes recent research and motivates future work on adaptive stochastic optimization methods, which have the potential to offer significant computational savings when training large-scale systems.
BibTeX:
@article{Curtis2020,
  author = {Frank E. Curtis and Katya Scheinberg},
  title = {Adaptive Stochastic Optimization},
  year = {2020}
}
Curtis FE, Dai Y and Robinson DP (2020), "A Subspace Acceleration Method for Minimization Involving a Group Sparsity-Inducing Regularizer", July, 2020.
Abstract: We consider the problem of minimizing an objective function that is the sum of a convex function and a group sparsity-inducing regularizer. Problems that integrate such regularizers arise in modern machine learning applications, often for the purpose of obtaining models that are easier to interpret and that have higher predictive accuracy. We present a new method for solving such problems that utilize subspace acceleration, domain decomposition, and support identification. Our analysis shows, under common assumptions, that the iterate sequence generated by our framework is globally convergent, converges to an 𝜖-approximate solution in at most O(-(1+p)) (respectively, O(-(2+p))) iterations for all 𝜖 bounded above and large enough (respectively, all 𝜖 bounded above) where p > 0 is an algorithm parameter, and exhibits superlinear local convergence. Preliminary numerical results for the task of binary classification based on regularized logistic regression show that our approach is efficient and robust, with the ability to outperform a state-of-the-art method.
BibTeX:
@article{Curtis2020a,
  author = {Frank E. Curtis and Yutong Dai and Daniel P. Robinson},
  title = {A Subspace Acceleration Method for Minimization Involving a Group Sparsity-Inducing Regularizer},
  year = {2020}
}
D'Ambra P, Durastante F and Filippone S (2020), "AMG preconditioners for Linear Solvers towards Extreme Scale", June, 2020.
Abstract: Linear solvers for large and sparse systems are a key element of scientific applications, and their efficient implementation is necessary to harness the computational power of current computers. Algebraic Multigrid (AMG) Preconditioners are a popular ingredient of such linear solvers; this is the motivation for the present work where we examine some recent developments in a package of AMG preconditioners to improve efficiency, scalability, and robustness on extreme-scale problems. The main novelty is the design and implementation of a new parallel coarsening algorithm based on aggregation of unknowns employing weighted graph matching techniques; this is a completely automated procedure, requiring no information from the user, and applicable to general symmetric positive definite (s.p.d.) matrices. The new coarsening algorithm improves in terms of numerical scalability at low operator complexity over decoupled aggregation algorithms available in previous releases of the package. The preconditioners package is built on the parallel software framework PSBLAS, which has also been updated to progress towards exascale. We present weak scalability results on two of the most powerful supercomputers in Europe, for linear systems with sizes up to O(10^10) unknowns.
BibTeX:
@article{DAmbra2020,
  author = {Pasqua D'Ambra and Fabio Durastante and Salvatore Filippone},
  title = {AMG preconditioners for Linear Solvers towards Extreme Scale},
  year = {2020}
}
Danilova M, Dvurechensky P, Gasnikov A, Gorbunov E, Guminov S, Kamzolov D and Shibaev I (2020), "Recent Theoretical Advances in Non-Convex Optimization", December, 2020.
Abstract: Motivated by recent increased interest in optimization algorithms for non-convex optimization in application to training deep neural networks and other optimization problems in data analysis, we give an overview of recent theoretical results on global performance guarantees of optimization algorithms for non-convex optimization. We start with classical arguments showing that general non-convex problems could not be solved efficiently in a reasonable time. Then we give a list of problems that can be solved efficiently to find the global minimizer by exploiting the structure of the problem as much as it is possible. Another way to deal with non-convexity is to relax the goal from finding the global minimum to finding a stationary point or a local minimum. For this setting, we first present known results for the convergence rates of deterministic first-order methods, which are then followed by a general theoretical analysis of optimal stochastic and randomized gradient schemes, and an overview of the stochastic first-order methods. After that, we discuss quite general classes of non-convex problems, such as minimization of α-weakly-quasi-convex functions and functions that satisfy Polyak--Lojasiewicz condition, which still allow obtaining theoretical convergence guarantees of first-order methods. Then we consider higher-order and zeroth-order/derivative-free methods and their convergence rates for non-convex optimization problems.
BibTeX:
@article{Danilova2020,
  author = {Marina Danilova and Pavel Dvurechensky and Alexander Gasnikov and Eduard Gorbunov and Sergey Guminov and Dmitry Kamzolov and Innokentiy Shibaev},
  title = {Recent Theoretical Advances in Non-Convex Optimization},
  year = {2020}
}
Das A, Briggs I, Gopalakrishnan G and Krishnamoorthy S (2020), "An Abstraction-guided Approach to Scalable and Rigorous Floating-Point Error Analysis", April, 2020.
Abstract: Automated techniques for rigorous floating-point round-off error analysis are important in areas including formal verification of correctness and precision tuning. Existing tools and techniques, while providing tight bounds, fail to analyze expressions with more than a few hundred operators, thus unable to cover important practical problems. In this work, we present Satire, a new tool that sheds light on how scalability and bound-tightness can be attained through a combination of incremental analysis, abstraction, and judicious use of concrete and symbolic evaluation. Satire has handled problems exceeding 200K operators. We present Satire's underlying error analysis approach, information-theoretic abstraction heuristics, and a wide range of case studies, with evaluation covering FFT, Lorenz system of equations, and various PDE stencil types. Our results demonstrate the tightness of Satire's bounds, its acceptable runtime, and valuable insights provided.
BibTeX:
@article{Das2020,
  author = {Arnab Das and Ian Briggs and Ganesh Gopalakrishnan and Sriram Krishnamoorthy},
  title = {An Abstraction-guided Approach to Scalable and Rigorous Floating-Point Error Analysis},
  year = {2020}
}
Das A, Briggs I, Gopalakrishnan G, Krishnamoorthy S and Panchekha P (2020), "Scalable yet Rigorous Floating-Point Error Analysis", In Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis. Los Alamitos, CA, USA, 11, 2020. , pp. 708-721. IEEE Computer Society.
Abstract: Automated techniques for rigorous floating-point round-off error analysis are a prerequisite to placing important activities in HPC such as precision allocation, verification, and code optimization on a formal footing. Yet existing techniques cannot provide tight bounds for expressions beyond a few dozen operators -- barely enough for HPC. In this work, we offer an approach embedded in a new tool called SATIRE that scales error analysis by four orders of magnitude compared to today's best-of-class tools. We explain how three key ideas underlying SATIRE helps it attain such scale: path strength reduction, bound optimization, and abstraction. SATIRE provides tight bounds and rigorous guarantees on significantly larger expressions with well over a hundred thousand operators, covering important examples including FFT, matrix multiplication, and PDE stencils.
BibTeX:
@inproceedings{Das2020a,
  author = {A. Das and I. Briggs and G. Gopalakrishnan and S. Krishnamoorthy and P. Panchekha},
  title = {Scalable yet Rigorous Floating-Point Error Analysis},
  booktitle = {Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis},
  publisher = {IEEE Computer Society},
  year = {2020},
  pages = {708--721},
  url = {https://doi.ieeecomputersociety.org/10.1109/SC41405.2020.00055},
  doi = {10.1109/SC41405.2020.00055}
}
Das A, Briggs I, Gopalakrishnan G, Krishnamoorthy S and Panchekha P (2020), "Scalable yet Rigorous Floating-Point Error Analysis", In Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis. Los Alamitos, CA, USA, 11, 2020. , pp. 708-721. IEEE Computer Society.
Abstract: Automated techniques for rigorous floating-point round-off error analysis are a prerequisite to placing important activities in HPC such as precision allocation, verification, and code optimization on a formal footing. Yet existing techniques cannot provide tight bounds for expressions beyond a few dozen operators -- barely enough for HPC. In this work, we offer an approach embedded in a new tool called SATIRE that scales error analysis by four orders of magnitude compared to today's best-of-class tools. We explain how three key ideas underlying SATIRE helps it attain such scale: path strength reduction, bound optimization, and abstraction. SATIRE provides tight bounds and rigorous guarantees on significantly larger expressions with well over a hundred thousand operators, covering important examples including FFT, matrix multiplication, and PDE stencils.
BibTeX:
@inproceedings{Das2020b,
  author = {A. Das and I. Briggs and G. Gopalakrishnan and S. Krishnamoorthy and P. Panchekha},
  title = {Scalable yet Rigorous Floating-Point Error Analysis},
  booktitle = {Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis},
  publisher = {IEEE Computer Society},
  year = {2020},
  pages = {708-721},
  url = {https://doi.ieeecomputersociety.org/10.1109/SC41405.2020.00055},
  doi = {10.1109/SC41405.2020.00055}
}
Davis TA, Hager WW, Kolodziej SP and Yearalan SN (2020), "Algorithm 1003: Mongoose, A Graph Coarsening and Partitioning Library", ACM Transactions on Mathematical Software. Vol. 46(7)
Abstract: Partitioning graphs is a common and useful operation in many areas, from parallel computing to VLSI design to sparse matrix algorithms. In this paper, we introduce Mongoose, a multilevel hybrid graph partitioning algorithm and library. Building on previous work in multilevel partitioning frameworks and combinatoric approaches, we introduce novel stall-reducing and stall-free coarsening strategies, as well as an efficient hybrid algorithm leveraging 1) traditional combinatoric methods and 2) continuous quadratic programming formulations. We demonstrate how this new hybrid algorithm outperforms either strategy in isolation, and we also compare Mongoose to METIS and demonstrate its effectiveness on large and social networking (power law) graphs.
BibTeX:
@article{Davis2020,
  author = {Davis, Timothy A. and Hager, William W. and Kolodziej, Scott P. and Yearalan, S. Nuri},
  title = {Algorithm 1003: Mongoose, A Graph Coarsening and Partitioning Library},
  journal = {ACM Transactions on Mathematical Software},
  year = {2020},
  volume = {46},
  number = {7},
  doi = {10.1145/3387915}
}
Davis JH, Daley C, Pophale S, Huber T, Chandrasekaran S and Wright NJ (2020), "Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs", October, 2020.
Abstract: Heterogeneous systems are becoming increasingly prevalent. In order to exploit the rich compute resources of such systems, robust programming models are needed for application developers to seamlessly migrate legacy code from today's systems to tomorrow's. Over the past decade and more, directives have been established as one of the promising paths to tackle programmatic challenges on emerging systems. This work focuses on applying and demonstrating OpenMP offloading directives on five proxy applications. We observe that the performance varies widely from one compiler to the other; a crucial aspect of our work is reporting best practices to application developers who use OpenMP offloading compilers. While some issues can be worked around by the developer, there are other issues that must be reported to the compiler vendors. By restructuring OpenMP offloading directives, we gain an 18x speedup for the su3 proxy application on NERSC's Cori system when using the Clang compiler, and a 15.7x speedup by switching max reductions to add reductions in the laplace mini-app when using the Cray-llvm compiler on Cori.
BibTeX:
@article{Davis2020a,
  author = {Joshua Hoke Davis and Christopher Daley and Swaroop Pophale and Thomas Huber and Sunita Chandrasekaran and Nicholas J. Wright},
  title = {Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs},
  year = {2020}
}
DeFreez D, Bhowmick A, Laguna I and Rubio-González C (2020), "Detecting and reproducing error-code propagation bugs in MPI implementations", In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming., 2, 2020. ACM.
Abstract: We present an approach to automatically detect and reproduce error code propagation bugs in MPI implementations. Specifically, we combine static analysis and program repair for bug detection, and apply fault injection to reproduce error propagation bugs found in MPI libraries written in C. We demonstrate our approach on the MPICH library, one of the most popular implementations of MPI, and the MPICH-based implementation MVAPICH, uncovering 447 previously unknown bugs. We discovered that 31 of these bugs result in program crashes, and 60% of the MPICH test suite is susceptible to crashing due to failures to propagate error codes. Moreover, 95 bugs produce undesirable behavior that has been confirmed dynamically, causing tests to fail, hanging processes, or simply dropping error codes before reaching user applications.
BibTeX:
@inproceedings{DeFreez2020,
  author = {Daniel DeFreez and Antara Bhowmick and Ignacio Laguna and Cindy Rubio-González},
  title = {Detecting and reproducing error-code propagation bugs in MPI implementations},
  booktitle = {Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
  publisher = {ACM},
  year = {2020},
  doi = {10.1145/3332466.3374515}
}
Demidov D (2020), "AMGCL —A C++ library for efficient solution of large sparse linear systems", Software Impacts., 11, 2020. Vol. 6, pp. 100037. Elsevier BV.
Abstract: AMGCL is a header-only C++ library for the solution of large sparse linear systems with algebraic multigrid. The method may be used as a black-box solver for computational problems in various fields, since it does not require any information about the underlying geometry. AMGCL provides an efficient, flexible, and extensible implementation of several iterative solvers and preconditioners on top of different backends allowing the acceleration of the solution with the help of OpenMP, OpenCL, or CUDA technologies. Most algorithms have both shared memory and distributed memory implementations. The library is published under a permissive MIT license.
BibTeX:
@article{Demidov2020,
  author = {Denis Demidov},
  title = {AMGCL —A C++ library for efficient solution of large sparse linear systems},
  journal = {Software Impacts},
  publisher = {Elsevier BV},
  year = {2020},
  volume = {6},
  pages = {100037},
  doi = {10.1016/j.simpa.2020.100037}
}
Demirci GV and Aykanat C (2020), "Cartesian Partitioning Models for 2D and 3D Parallel SpGEMM Algorithms", IEEE Transactions on Parallel and Distributed Systems., 12, 2020. Vol. 31(12), pp. 2763-2775. Institute of Electrical and Electronics Engineers (IEEE).
Abstract: The focus is distributed-memory parallelization of sparse-general-matrix-multiplication (SpGEMM). Parallel SpGEMM algorithms are classified under one-dimensional (1D), 2D, and 3D categories denoting the number of dimensions by which the 3D sparse workcube representing the iteration space of SpGEMM is partitioned. Recently proposed successful 2D- and 3D-parallel SpGEMM algorithms benefit from upper bounds on communication overheads enforced by 2D and 3D cartesian partitioning of the workcube on 2D and 3D virtual processor grids, respectively. However, these methods are based on random cartesian partitioning and do not utilize sparsity patterns of SpGEMM instances for reducing the communication overheads. We propose hypergraph models for 2D and 3D cartesian partitioning of the workcube for further reducing the communication overheads of these 2D- and 3D- parallel SpGEMM algorithms. The proposed models utilize two- and three-phase partitioning that exploit multi-constraint hypergraph partitioning formulations. Extensive experimentation performed on 20 SpGEMM instances by using upto 900 processors demonstrate that proposed partitioning models significantly improve the scalability of 2D and 3D algorithms. For example, in 2D-parallel SpGEMM algorithm on 900 processors, the proposed partitioning model respectively achieves 85 and 42 percent decrease in total volume and total number of messages, leading to 1.63 times higher speedup compared to random partitioning, on average.
BibTeX:
@article{Demirci2020,
  author = {Gunduz Vehbi Demirci and Cevdet Aykanat},
  title = {Cartesian Partitioning Models for 2D and 3D Parallel SpGEMM Algorithms},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
  year = {2020},
  volume = {31},
  number = {12},
  pages = {2763--2775},
  doi = {10.1109/tpds.2020.3000708}
}
Demmel J, Dongarra J, Langou J, Langou J, Luszczek P and Mahoney MW (2020), "Prospectus for the Next LAPACK and ScaLAPACK Libraries:Basic ALgebra LIbraries for Sustainable Technology withInterdisciplinary Collaboration (BALLISTIC)". Thesis at: University of Tennessee.
Abstract: The convergence of several unprecedented changes, including formidable new system design constraints and revolutionary levels of heterogeneity, has made it clear that much of the essential software infrastructure of computational science and engineering is, or will soon be, obsolete. Math libraries have historically been in the vanguard of software that must be adapted first to such changes, both because these low-level workhorses are so critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the problems that novel architectures pose. Under the Basic ALgebra LIbraries for Sustainable Technology with Interdisciplinary Collaboration (BALLISTIC) project, the principal designers of the Linear Algebra PACKage (LAPACK) and the Scalable Linear Algebra PACKage (ScaLAPACK), the combination of which is abbreviated Sca/LAPACK, aim to enhance and update these libraries for the ongoing revolution in processor architecture, system design, and application requirements by incorporating them into a layered package of software components -- the BALLISTIC ecosystem -- that provides users seamless access to state-of-the-art solver implementations through familiar and improved Sca/LAPACK interfaces.\ The set of innovations and improvements that will be made available through BALLISTIC is the result of a combination of inputs from a variety of sources: the authors' own algorithmic and software research, which attacks the challenges of multi-core, hybrid, and extreme-scale system designs; extensive interactions with users, vendors, and the management of large high-performance computing (HPC) facilities to help anticipate the demands and opportunities of new architectures and programming languages; and, finally, the enthusiastic participation of the research community in developing and offering enhanced versions of existing dense linear algebra software components. Aiming to help applications run portably at all levels of the platform pyramid, including in cloud-based systems, BALLISTIC's technical agenda includes: (1) adding new functionality requested by stakeholder communities; (2) incorporating vastly improved numerical methods and algorithms; (3) leveraging successful research results to transition Sca/LAPACK (interfaces) to multi-core and accelerator-enabled versions; (4) providing user-controllable autotuning for the deployed software; (5) introducing new interfaces and data structures to increase ease of use; (6) enhancing engineering for evolution via standards and community engagement; and (7) continuing to expand application community outreach. Enhanced engineering will also help keep the reference implementation for Sca/LAPACK efficient, maintainable, and testable at reasonable cost in the future.\ The Sca/LAPACK libraries are the community standard for dense linear algebra. They have been adopted and/or supported by a large community of users, computing centers, and HPC vendors. Learning to use them is a basic part of the education of a computational scientist or engineer in many fields and at many academic institutions. No other numerical library can claim this breadth of integration with the community. Consequently, enhancing these libraries with state-of-the-art methods and algorithms and adapting them for new and emerging platforms (reaching up to extreme scale and including cloud-based environments) is set to have a correspondingly large impact on the research and education community, government laboratories, and private industry.
BibTeX:
@techreport{Demmel2020,
  author = {James Demmel and Jack Dongarra and Julie Langou and Julien Langou and Piotr Luszczek and Michael W. Mahoney},
  title = {Prospectus for the Next LAPACK and ScaLAPACK Libraries:Basic ALgebra LIbraries for Sustainable Technology withInterdisciplinary Collaboration (BALLISTIC)},
  school = {University of Tennessee},
  year = {2020},
  url = {https://www.icl.utk.edu/files/publications/2020/icl-utk-1391-2020.pdf}
}
Deng X, Sun T, Du P and Li D (2020), "A Nonconvex Implementation of Sparse Subspace Clustering: Algorithm and Convergence Analysis", IEEE Access. Vol. 8, pp. 54741-54750. Institute of Electrical and Electronics Engineers (IEEE).
Abstract: Subspace clustering has been widely applied to detect meaningful clusters in high-dimensional data spaces. And the sparse subspace clustering (SSC) obtains superior clustering performance by solving a relaxed _0-minimization problem with _1-norm. Although the use of _1-norm instead of the _0 one can make the object function convex, it causes large errors on large coefficients in some cases. In this paper, we study the sparse subspace clustering algorithm based on a nonconvex modeling formulation. Specifically, we introduce a nonconvex pseudo-norm that makes a better approximation to the _0-minimization than the traditional _1-minimization framework and consequently finds a better affinity matrix. However, this formulation makes the optimization task challenging due to that the traditional alternating direction method of multipliers (ADMM) encounters troubles in solving the nonconvex subproblems. In view of this, the reweighted techniques are employed in making these subproblems convex and easily solvable. We provide several guarantees to derive the convergence results, which proves that the nonconvex algorithm is globally convergent to a critical point. Experiments on two real-world problems of motion segmentation and face clustering show that our method outperforms state-of-the-art techniques.
BibTeX:
@article{Deng2020,
  author = {Xiaoge Deng and Tao Sun and Peibing Du and Dongsheng Li},
  title = {A Nonconvex Implementation of Sparse Subspace Clustering: Algorithm and Convergence Analysis},
  journal = {IEEE Access},
  publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
  year = {2020},
  volume = {8},
  pages = {54741--54750},
  doi = {10.1109/access.2020.2981740}
}
Dereziński M and Mahoney MW (2020), "Determinantal Point Processes in Randomized Numerical Linear Algebra", May, 2020.
Abstract: Randomized Numerical Linear Algebra (RandNLA) uses randomness to develop improved algorithms for matrix problems that arise in scientific computing, data science, machine learning, etc. Determinantal Point Processes (DPPs), a seemingly unrelated topic in pure and applied mathematics, is a class of stochastic point processes with probability distribution characterized by sub-determinants of a kernel matrix. Recent work has uncovered deep and fruitful connections between DPPs and RandNLA which lead to new guarantees and improved algorithms that are of interest to both areas. We provide an overview of this exciting new line of research, including brief introductions to RandNLA and DPPs, as well as applications of DPPs to classical linear algebra tasks such as least squares regression, low-rank approximation and the Nyström method. For example, random sampling with a DPP leads to new kinds of unbiased estimators for least squares, enabling more refined statistical and inferential understanding of these algorithms; a DPP is, in some sense, an optimal randomized algorithm for the Nyström method; and a RandNLA technique called leverage score sampling can be derived as the marginal distribution of a DPP. We also discuss recent algorithmic developments, illustrating that, while not quite as efficient as standard RandNLA techniques, DPP-based algorithms are only moderately more expensive.
BibTeX:
@article{Derezinski2020,
  author = {Michał Dereziński and Michael W. Mahoney},
  title = {Determinantal Point Processes in Randomized Numerical Linear Algebra},
  year = {2020}
}
Devarakonda A and Demmel J (2020), "Avoiding Communication in Logistic Regression", November, 2020.
Abstract: Stochastic gradient descent (SGD) is one of the most widely used optimization methods for solving various machine learning problems. SGD solves an optimization problem by iteratively sampling a few data points from the input data, computing gradients for the selected data points, and updating the solution. However, in a parallel setting, SGD requires interprocess communication at every iteration. We introduce a new communication-avoiding technique for solving the logistic regression problem using SGD. This technique re-organizes the SGD computations into a form that communicates every s iterations instead of every iteration, where s is a tuning parameter. We prove theoretical flops, bandwidth, and latency upper bounds for SGD and its new communication-avoiding variant. Furthermore, we show experimental results that illustrate that the new Communication-Avoiding SGD (CA-SGD) method can achieve speedups of up to 4.97× on a high-performance Infiniband cluster without altering the convergence behavior or accuracy.
BibTeX:
@article{Devarakonda2020,
  author = {Aditya Devarakonda and James Demmel},
  title = {Avoiding Communication in Logistic Regression},
  year = {2020}
}
Devine K and Ballard G (2020), "GentenMPI: Distributed Memory Sparse Tensor Decomposition". Thesis at: Sandia National Laboratories.
Abstract: GentenMPl is a toolkit of sparse canonical polyadic (CP) tensor decomposition algorithms that is designed to run effectively on distributed-memory high-performance computers. Its use of distributed-memory parallelism enables it to efficiently decompose tensors that are too large for a single compute node's memory. GentenMPl leverages Sandia's decades-long investment in the Trilinos solver framework for much of its parallel-computation capability. Trilinos contains numerical algorithms and linear algebra classes that have been optimized for parallel simulation of complex physical phenomena. This work applies these tools to the data science problem of sparse tensor decomposition. In this report, we describe the use of Trilinos in GentenMPl, extensions needed for sparse tensor decomposition, and implementations of the CP-ALS (CP via alternating least squares [4, 7]) and GCP-SGD (generalized CP via stochastic gradient descent [11, 12, 17]) sparse tensor decomposition algorithms. We show that GentenMPl can decompose sparse tensors of extreme size, e.g., a 12.6-terabyte tensor on 8192 computer cores. We demonstrate that the Trilinos backbone provides good strong and weak scaling of the tensor decomposition algorithms.
BibTeX:
@techreport{Devine2020,
  author = {Karen Devine and Grey Ballard},
  title = {GentenMPI: Distributed Memory Sparse Tensor Decomposition},
  school = {Sandia National Laboratories},
  year = {2020},
  url = {https://www.osti.gov/servlets/purl/1656940}
}
Dinda P, Bernat A and Hetland C (2020), "Spying on the Floating Point Behavior ofExisting, Unmodified Scientific Applications", In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing.
Abstract: Scientific (and other) applications are critically dependent on calculations done using IEEE floating point arithmetic. A number of concerns have been raised about correctness in such applications given the numerous gotchas the IEEE standard presents for developers, as well as the complexity of its implementation at the hardware and compiler levels. The standard and its implementations do provide mechanisms for analyzing floating point arithmetic as it executes, making it possible to find and track problematic operations. However, this capability is seldom used in practice. In response, we have developed FPSpy, a tool that provides this capability when operating underneath existing, unmodified x64 application binaries on Linux, including those using thread- and process-level parallelism. FPSpy can observe application behavior without any cooperation from the application or developer, and can potentially be deployed as part of a job launch process. We present the design, implementation, and performance evaluation of FPSpy. FPSpy operates conservatively, getting out of the way if the application itself begins to use any of the OS or hardware features that FPSpy depends on. Its overhead can be throttled, allowing a tradeoff between which and how many unusual events are to be captured, and the slowdown incurred by the application, with the low point providing virtually zero slowdown. We evaluated FPSpy by using it to methodically study seven widely-used applications/frameworks from a range of domains (five of which are in the NSF XSEDE top-20), as well as the NAS and PARSEC benchmark suites. All told, these comprise about 7.5 million lines of source code in a wide range of languages, and parallelism models (including OpenMP and MPI). FPSpy was able to produce trace information for all of them. The traces show that problematic floating point events occur in both the applications and the benchmarks. Analysis of the rounding behavior captured in our traces also suggests the feasibility of an approach to adding adaptive precision underneath existing, unmodified binaries.
BibTeX:
@inproceedings{Dinda2020,
  author = {Peter Dinda and Alex Bernat and Conor Hetland},
  title = {Spying on the Floating Point Behavior ofExisting, Unmodified Scientific Applications},
  booktitle = {Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing},
  year = {2020},
  url = {http://pdinda.org/Papers/hpdc20.pdf}
}
Ding N, Williams S, Liu Y and Li XS (2020), "Leveraging One-Sided Communication for Sparse Triangular Solvers"
Abstract: In this paper, we implement and evaluate a one-sided communication-based distributed-memory sparse triangular solve (SpTRSV). SpTRSV is used in conjunction with Sparse LU to affect preconditioning in linear solvers. One-sided communication paradigms enjoy higher effective network bandwidth and lower synchronization costs compared to their two-sided counterparts. We use a passive target mode in one-sided communication to implement a synchronizationfree task queue to manage the messaging between producerconsumer pairs. Whereas some numerical methods lend themselves to simple performance analysis, the DAG-based computational graph of SpTRSV demands we construct a critical path performance model in order to assess our observed performance relative to machine capabilities. In alignment with our model, our foMPI-based one-sided implementation of SpTRSV reduces communication time by 1.5× to 2.5× and improves SpTRSV solver performance by up to 2.4× compared to the SuperLU DIST's two-sided MPI implementation running on 64 to 4,096 processes on Cray supercomputers.
BibTeX:
@article{Ding2020,
  author = {Nan Ding and Samuel Williams and Yang Liu and Xiaoye S. Li},
  title = {Leveraging One-Sided Communication for Sparse Triangular Solvers},
  year = {2020}
}
Dinh G and Demmel J (2020), "Communication-Optimal Tilings for Projective Nested Loops with Arbitrary Bounds", February, 2020.
Abstract: Reducing communication - either between levels of a memory hierarchy or between processors over a network - is a key component of performance optimization (in both time and energy) for many problems, including dense linear algebra, particle interactions, and machine learning. For these problems, which can be represented as nested-loop computations, previous tiling based approaches have been used to find both lower bounds on the communication required to execute them and optimal rearrangements, or blockings, to attain such lower bounds. However, such general approaches have typically assumed the problem sizes are large, an assumption that is often not met in practice. For instance, the classical (num arithmetic operations)/(cache size)^1/2 lower bound for matrix multiplication is not tight for matrix-vector multiplications, which must read in at least O(num arithmetic operations) words of memory; similar issues occur for almost all convolutions in machine learning applications, which use extremely small filter sizes (and therefore, loop bounds). In this paper, we provide an efficient way to both find and obtain, via an appropriate, efficiently constructible blocking, communication lower bounds and matching tilings which attain these lower bounds for nested loop programs with arbitrary loop bounds that operate on multidimensional arrays in the projective case, where the array indices are subsets of the loop indices. Our approach works on all such problems, regardless of dimensionality, size, memory access patterns, or number of arrays, and directly applies to (among other examples) matrix multiplication and similar dense linear algebra operations, tensor contractions, n-body pairwise interactions, pointwise convolutions, and fully connected layers.
BibTeX:
@article{Dinh2020,
  author = {Grace Dinh and James Demmel},
  title = {Communication-Optimal Tilings for Projective Nested Loops with Arbitrary Bounds},
  year = {2020}
}
Dixit R and Bajwa WU (2020), "Exit Time Analysis for Approximations of Gradient Descent Trajectories Around Saddle Points", June, 2020.
Abstract: This paper considers the problem of understanding the exit time for trajectories of gradient-related first-order methods from saddle neighborhoods under some initial boundary conditions. Given the `flat' geometry around saddle points, first-order methods can struggle in escaping these regions in a fast manner due to the small magnitudes of gradients encountered. In particular, while it is known that gradient-related first-order methods escape strict-saddle neighborhoods, existing literature does not explicitly leverage the local geometry around saddle points in order to control behavior of gradient trajectories. It is in this context that this paper puts forth a rigorous geometric analysis of the gradient-descent method around strict-saddle neighborhoods using matrix perturbation theory. In doing so, it provides a key result that can be used to generate an approximate gradient trajectory for any given initial conditions. In addition, the analysis leads to a linear exit-time solution for gradient-descent method under certain necessary initial conditions for a class of strict-saddle functions.
BibTeX:
@article{Dixit2020,
  author = {Rishabh Dixit and Waheed U. Bajwa},
  title = {Exit Time Analysis for Approximations of Gradient Descent Trajectories Around Saddle Points},
  year = {2020}
}
Doikov N and Nesterov Y (2020), "Inexact Tensor Methods with Dynamic Accuracies", February, 2020.
Abstract: In this paper, we study inexact high-order Tensor Methods for solving convex optimization problems with composite objective. At every step of such methods, we use approximate solution of the auxiliary problem, defined by the bound for the residual in function value. We propose two dynamic strategies for choosing the inner accuracy: the first one is decreasing as 1/k^p + 1, where p ≥ 1 is the order of the method and k is the iteration counter, and the second approach is using for the inner accuracy the last progress in the target objective. We show that inexact Tensor Methods with these strategies achieve the same global convergence rate as in the error-free case. For the second approach we also establish local superlinear rates (for p ≥ 2), and propose the accelerated scheme. Lastly, we present computational results on a variety of machine learning problems for several methods and different accuracy policies.
BibTeX:
@article{Doikov2020,
  author = {Nikita Doikov and Yurii Nesterov},
  title = {Inexact Tensor Methods with Dynamic Accuracies},
  year = {2020}
}
Doikov N and Nesterov Y (2020), "Convex optimization based on global lower second-order models", CORE Discussion Papers ; 2020/23 (2020) 22 pages http://hdl.handle.net/2078.1/230370., June, 2020.
Abstract: In this paper, we present new second-order algorithms for composite convex optimization, called Contracting-domain Newton methods. These algorithms are affine-invariant and based on global second-order lower approximation for the smooth component of the objective. Our approach has an interpretation both as a second-order generalization of the conditional gradient method, or as a variant of trust-region scheme. Under the assumption, that the problem domain is bounded, we prove 𝒪(1/k^2) global rate of convergence in functional residual, where k is the iteration counter, minimizing convex functions with Lipschitz continuous Hessian. This significantly improves the previously known bound 𝒪(1/k) for this type of algorithms. Additionally, we propose a stochastic extension of our method, and present computational results for solving empirical risk minimization problem.
BibTeX:
@article{Doikov2020a,
  author = {Nikita Doikov and Yurii Nesterov},
  title = {Convex optimization based on global lower second-order models},
  journal = {CORE Discussion Papers ; 2020/23 (2020) 22 pages http://hdl.handle.net/2078.1/230370},
  year = {2020}
}
Doikov N and Nesterov Y (2020), "Affine-invariant contracting-point methods for Convex Optimization", September, 2020.
Abstract: In this paper, we develop new affine-invariant algorithms for solving composite convex minimization problems with bounded domain. We present a general framework of Contracting-Point methods, which solve at each iteration an auxiliary subproblem restricting the smooth part of the objective function onto contraction of the initial domain. This framework provides us with a systematic way for developing optimization methods of different order, endowed with the global complexity bounds. We show that using an appropriate affine-invariant smoothness condition, it is possible to implement one iteration of the Contracting-Point method by one step of the pure tensor method of degree p ≥ 1. The resulting global rate of convergence in functional residual is then cal O(1 / k^p), where k is the iteration counter. It is important that all constants in our bounds are affine-invariant. For p = 1, our scheme recovers well-known Frank-Wolfe algorithm, providing it with a new interpretation by a general perspective of tensor methods. Finally, within our framework, we present efficient implementation and total complexity analysis of the inexact second-order scheme (p = 2), called Contracting Newton method. It can be seen as a proper implementation of the trust-region idea. Preliminary numerical results confirm its good practical performance both in the number of iterations, and in computational time.
BibTeX:
@article{Doikov2020b,
  author = {Nikita Doikov and Yurii Nesterov},
  title = {Affine-invariant contracting-point methods for Convex Optimization},
  year = {2020}
}
Domenech-Asensi G and Kazmierski TJ (2020), "Stability and Efficiency of Explicit Integration in Interconnect Analysis on GPUs", In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems., 10, 2020. IEEE.
Abstract: This paper presents a new high-performance technique to parallelise numeric integration of large VLSI interconnect analog models on a general purpose GPU. The technique is based on the combination of space-state formulation with an explicit integration method based on the Adams-Bashforth second order formula. The paper studies the stability of the variable step explicit method and proposes a technique to guarantee integration stability specifically for interconnect systems. Although explicit methods require smaller integration steps compared to those of the traditional implicit techniques, they avoid the complex calculations inherent to implicit integration. The proposed approach is demonstrated using an RC VLSI interconnect model and results are compared to those achieved by Ngspice, a state-of-the-art implicit integration solver, implemented on the same parallel hardware. The results show that the speed-up of the parallelised explicit solution reaches one order of magnitude for large systems and is increasing with the circuit size.
BibTeX:
@inproceedings{DomenechAsensi2020,
  author = {Gines Domenech-Asensi and Tom J. Kazmierski},
  title = {Stability and Efficiency of Explicit Integration in Interconnect Analysis on GPUs},
  booktitle = {Proceedings of the 2020 IEEE International Symposium on Circuits and Systems},
  publisher = {IEEE},
  year = {2020},
  doi = {10.1109/iscas45731.2020.9181157}
}
Domingos J and Moura JMF (2020), "Graph Fourier Transform: A Stable Approximation", January, 2020.
Abstract: In Graph Signal Processing (GSP), data dependencies are represented by a graph whose nodes label the data and the edges capture dependencies among nodes. The graph is represented by a weighted adjacency matrix A that, in GSP, generalizes the Discrete Signal Processing (DSP) shift operator z^-1. The (right) eigenvectors of the shift A (graph spectral components) diagonalize A and lead to a graph Fourier basis F that provides a graph spectral representation of the graph signal. The inverse of the (matrix of the) graph Fourier basis F is the Graph Fourier transform (GFT), F^-1. Often, including in real world examples, this diagonalization is numerically unstable. This paper develops an approach to compute an accurate approximation to F and F^-1, while insuring their numerical stability, by means of solving a non convex optimization problem. To address the non-convexity, we propose an algorithm, the stable graph Fourier basis algorithm (SGFA) that we prove to exponentially increase the accuracy of the approximating F per iteration. Likewise, we can apply SGFA to A^H and, hence, approximate the stable left eigenvectors for the graph shift A and directly compute the GFT. We evaluate empirically the quality of SGFA by applying it to graph shifts A drawn from two real world problems, the 2004 US political blogs graph and the Manhattan road map, carrying out a comprehensive study on tradeoffs between different SGFA parameters. We also confirm our conclusions by applying SGFA on very sparse and very dense directed Erd˝ os-Rényi graphs.
BibTeX:
@article{Domingos2020,
  author = {João Domingos and José M. F. Moura},
  title = {Graph Fourier Transform: A Stable Approximation},
  year = {2020}
}
Dong J and Tong XT (2020), "Replica Exchange for Non-Convex Optimization", January, 2020.
Abstract: Gradient descent (GD) is known to converge quickly for convex objective functions, but it can be trapped at local minimums. On the other hand, Langevin dynamics (LD) can explore the state space and find global minimums, but in order to give accurate estimates, LD needs to run with small discretization stepsize and weak stochastic force, which in general slow down its convergence. This paper shows that these two algorithms can "collaborate" through a simple exchange mechanism, in which they swap their current positions if LD yields a lower objective function. This idea can be seen as the singular limit of the replica exchange technique from the sampling literature. We show that this new algorithm converges to the global minimum linearly with high probability, assuming the objective function is strongly convex in a neighborhood of the unique global minimum. By replacing gradients with stochastic gradients, and adding a proper threshold to the exchange mechanism, our algorithm can also be used in online settings. We further verify our theoretical results through some numerical experiments, and observe superior performance of the proposed algorithm over running GD or LD alone.
BibTeX:
@article{Dong2020,
  author = {Jing Dong and Xin T. Tong},
  title = {Replica Exchange for Non-Convex Optimization},
  year = {2020}
}
Dongarra J, Gates M, Luszczek P and Tomov S (2020), "Translational Process: Mathematical Software Perspective". Thesis at: University of Tennessee.
Abstract: Each successive generation of computer architecture has brought new challenges to achieving high performance mathematical solvers, necessitating development and analysis of new algorithms, which are then embodied in software libraries. These libraries hide architectural details from applications, allowing them to achieve a level of portability across platforms from desktops to worldclass high performance computing (HPC) systems. Thus there has been an informal translational computer science process of developing algorithms and distributing them in open source software libraries for adoption by applications and vendors. With the move to exascale, increasing intentionality about this process will benefit the long-term sustainability of the scientific software stack.
BibTeX:
@techreport{Dongarra2020,
  author = {Jack Dongarra and Mark Gates and Piotr Luszczek and Stanimire Tomov},
  title = {Translational Process: Mathematical Software Perspective},
  school = {University of Tennessee},
  year = {2020},
  url = {https://www.icl.utk.edu/files/publications/2020/icl-utk-1404-2020.pdf}
}
Duff I, Hogg J and Lopez F (2020), "A New Sparse LDL^T Solver Using A Posteriori Threshold Pivoting", SIAM Journal on Scientific Computing., 1, 2020. Vol. 42(2), pp. C23-C42. Society for Industrial & Applied Mathematics (SIAM).
Abstract: The factorization of sparse symmetric indefinite systems is particularly challenging since pivoting is required to maintain stability of the factorization. Pivoting techniques generally offer limited parallelism and are associated with significant data movement hindering the scalability of these methods. Variants of the threshold partial pivoting (TPP) algorithm, for example, have often been used because of its numerical robustness but standard implementations exhibit poor parallel performance. On the other hand, some methods trade stability for performance on parallel architectures such as the supernode Bunch--Kaufman used in the PARDISO solver. In this case, however, the factors obtained might not be used to accurately compute the solution of the system. For this reason we have designed a task-based LDL^T factorization algorithm based on a new pivoting strategy called a posteriori threshold pivoting (APTP) that is much more suitable for modern multicore architectures and has the same numerical robustness as the TPP strategy. We implemented our algorithm in a new version of the SPRAL sparse symmetric indefinite direct solver, which initially supported GPU-only factorization. We have used OpenMP 4 task features to implement a multifrontal algorithm with dense factorizations using the novel APTP, and we show that it performs favorably compared to the state-of-the-art solvers HSL_MA86, HSL_MA97 and PARDISO both in terms of performance on a multicore machine and in terms of numerical robustness. Finally we show that this new solver is able to make use of GPU devices for accelerating the factorization on heterogeneous architectures.
BibTeX:
@article{Duff2020,
  author = {Iain Duff and Jonathan Hogg and Florent Lopez},
  title = {A New Sparse LDL^T Solver Using A Posteriori Threshold Pivoting},
  journal = {SIAM Journal on Scientific Computing},
  publisher = {Society for Industrial & Applied Mathematics (SIAM)},
  year = {2020},
  volume = {42},
  number = {2},
  pages = {C23--C42},
  doi = {10.1137/18m1225963}
}
Duff I, Leleux P, Ruiz D and Torun FS (2020), "Improving the Scalability of the ABCD Solver with a Combination of New Load Balancing and Communication Minimization Techniques", Advances in Parallel Computing. Vol. 36(Parallel Computing: Technology Trends), pp. 277-286. IOS Press.
Abstract: The hybrid scheme block row-projection method implemented in the ABCD Solver is designed for solving large sparse unsymmetric systems of equations on distributed memory parallel computers. The method implements a block Cimmino iterative scheme, accelerated with a stabilized block conjugate gradient algorithm. An augmented pseudo-direct variant has also been developed to overcome convergence issues. Both methods are included in the ABCD solver with a hybrid parallelization scheme. The parallel performance of the ABCD Solver is improved in the first non-beta release, version 1.0, which we present in this paper. Novel algorithms for the distribution of partitions to processes are introduced to minimize communication as well as to balance the workload. Furthermore, the master-slave approach on each subsystem is also improved in order to achieve higher scalability through run-time placement of processes. We illustrate the improved parallel scalability of the ABCD Solver on a distributed memory architecture by solving several problems from the SuiteSparse Matrix Collection.
BibTeX:
@article{Duff2020a,
  author = {Duff, Iain and Leleux, Philippe and Ruiz, Daniel and Torun, F. Sukru},
  title = {Improving the Scalability of the ABCD Solver with a Combination of New Load Balancing and Communication Minimization Techniques},
  journal = {Advances in Parallel Computing},
  publisher = {IOS Press},
  year = {2020},
  volume = {36},
  number = {Parallel Computing: Technology Trends},
  pages = {277--286},
  doi = {10.3233/APC200052}
}
Dvurechensky P, Shtern S, Staudigl M, Ostroukhov P and Safin K (2020), "Self-concordant analysis of Frank-Wolfe algorithms", February, 2020.
Abstract: Projection-free optimization via different variants of the Frank-Wolfe (FW) method has become one of the cornerstones in optimization for machine learning since in many cases the linear minimization oracle is much cheaper to implement than projections and some sparsity needs to be preserved. In a number of applications, e.g. Poisson inverse problems or quantum state tomography, the loss is given by a self-concordant (SC) function having unbounded curvature, implying absence of theoretical guarantees for the existing FW methods. We use the theory of SC functions to provide a new adaptive step size for FW methods and prove global convergence rate O(1k), k being the iteration counter. If the problem can be represented by a local linear minimization oracle, we are the first to propose a FW method with linear convergence rate without assuming neither strong convexity nor a Lipschitz continuous gradient.
BibTeX:
@article{Dvurechensky2020,
  author = {Pavel Dvurechensky and Shimrit Shtern and Mathias Staudigl and Petr Ostroukhov and Kamil Safin},
  title = {Self-concordant analysis of Frank-Wolfe algorithms},
  year = {2020}
}
Edwars JA (2020), "Study of fine-grained, irregular parallel applications on a many-core processor". Thesis at: University of Maryland, Institute for Advanced Computer Studies and Department of Electrical and Computer Engineering.
Abstract: This dissertation demonstrates the possibility of obtaining strong speedups for a variety of parallel applications versus the best serial and parallel implementations on commodity platforms. These results were obtained using the PRAM-inspired Explicit Multi-Threading (XMT) many-core computing platform, which is designed to efficiently support execution of both serial and parallel code and switching between the two.
BibTeX:
@phdthesis{Edwars2020,
  author = {James Alexander Edwars},
  title = {Study of fine-grained, irregular parallel applications on a many-core processor},
  school = {University of Maryland, Institute for Advanced Computer Studies and Department of Electrical and Computer Engineering},
  year = {2020},
  url = {https://drum.lib.umd.edu/bitstream/handle/1903/26626/Edwards_umd_0117E_21139.pdf}
}
Ek D and Forsgren A (2020), "Approximate solution of system of equations arising in interior-point methods for bound-constrained optimization", April, 2020.
Abstract: The focus in this paper is interior-point methods for bound-constrained nonlinear optimization where the system of nonlinear equations that arise are solved with Newton's method. There is a trade-off between solving Newton systems directly, which give high quality solutions, and solving many approximate Newton systems which are computationally less expensive but give lower quality solutions. We propose partial and full approximate solutions to the Newton systems, which in general involves solving a reduced system of linear equations. The specific approximate solution and the size of the reduced system that needs to be solved at each iteration are determined by estimates of the active and inactive constraints at the solution. These sets are at each iteration estimated by a simple heuristic. In addition, we motivate and suggest two modified-Newton approaches which are based on an intermediate step that consists of the partial approximate solutions. The theoretical setting is introduced and asymptotic error bounds are given along with numerical results for bound-constrained convex quadratic optimization problems, both random and from the CUTEst test collection.
BibTeX:
@article{Ek2020,
  author = {David Ek and Anders Forsgren},
  title = {Approximate solution of system of equations arising in interior-point methods for bound-constrained optimization},
  year = {2020}
}
Eldén L and Dehghan M (2020), "A Krylov-Schur like method for computing the best rank-(r_1,r_2,r_3) approximation of large and sparse tensors", December, 2020.
Abstract: The paper is concerned with methods for computing the best low multilinear rank approximation of large and sparse tensors. Krylov-type methods have been used for this problem; here block versions are introduced. For the computation of partial eigenvalue and singular value decompositions of matrices the Krylov-Schur (restarted Arnoldi) method is used. We describe a generalization of this method to tensors, for computing the best low multilinear rank approximation of large and sparse tensors. In analogy to the matrix case, the large tensor is only accessed in multiplications between the tensor and blocks of vectors, thus avoiding excessive memory usage. It is proved that, if the starting approximation is good enough, then the tensor Krylov-Schur method is convergent. Numerical examples are given for synthetic tensors and sparse tensors from applications, which demonstrate that for most large problems the Krylov-Schur method converges faster and more robustly than higher order orthogonal iteration.
BibTeX:
@article{Elden2020,
  author = {L. Eldén and M. Dehghan},
  title = {A Krylov-Schur like method for computing the best rank-(r_1,r_2,r_3) approximation of large and sparse tensors},
  year = {2020}
}
Elekes M, Nagy A, Sándor D, Antal JB, Davis TA and Szárnyas G (2020), "A GraphBLAS solution to the SIGMOD 2014 Programming Contest using multi-source BFS", In Proceedings of the 2020 IEEE High Performance Extreme Computing Conference.
Abstract: The GraphBLAS standard defines a set of fundamental building blocks for formulating graph algorithms in the language of linear algebra. Since its first release in 2017, the expressivity of the GraphBLAS API and the performance of its implementations (such as SuiteSparse:GraphBLAS) have been studied on a number of textbook graph algorithms such as BFS, single-source shortest path, and connected components. However, less attention was devoted to other aspects of graph processing such as handling typed and attributed graphs (also known as property graphs), and making use of complex graph query techniques (handling paths, aggregation, and filtering). To study these problems in more detail, we have used GraphBLAS to solve the case study of the 2014 SIGMOD Programming Contest, which defines complex graph processing tasks that require a diverse set of operations. Our solution makes heavy use of multi-source BFS algorithms expressed as sparse matrix-matrix multiplications along with other GraphBLAS techniques such as masking and submatrix extraction. While the queries can be formulated in GraphBLAS concisely, our performance evaluation shows mixed results. For some queries and data sets, the performance is competitive with the hand-optimized top solutions submitted to the contest, however, in some cases, it is currently outperformed by orders of magnitude.
BibTeX:
@inproceedings{Elekes2020,
  author = {Marton Elekes and Attila Nagy and Dávid Sándor and János Benjamin Antal and Timothy A. Davis and Gabor Szárnyas},
  title = {A GraphBLAS solution to the SIGMOD 2014 Programming Contest using multi-source BFS},
  booktitle = {Proceedings of the 2020 IEEE High Performance Extreme Computing Conference},
  year = {2020}
}
Elimelech K and Indelman V (2020), "Efficient Modification of the Upper Triangular Square Root Matrix on Variable Reordering", IEEE Robotics and Automation Letters.
Abstract: In probabilistic state inference, we seek to estimate the state of an (autonomous) agent from noisy observations. It can be shown that, under certain assumptions, finding the estimate is equivalent to solving a linear least squares problem. Solving such a problem is done by calculating the upper triangular matrix R from the coefficient matrix A, using the QR or Cholesky factorizations; this matrix is commonly referred to as the "square root matrix". In sequential estimation problems, we are often interested in periodic optimization of the state variable order, e.g., to reduce fill-in, or to apply a predictive variable ordering tactic; however, changing the variable order implies expensive re-factorization of the system. Thus, we address the problem of modifying an existing square root matrix R, to convey reordering of the variables. To this end, we identify several conclusions regarding the effect of column permutation on the factorization, to allow efficient modification of R, without accessing A at all, or with minimal re-factorization. The proposed parallelizable algorithm achieves a significant improvement in performance over the state-of-the-art incremental smoothing and mapping approach, which considers incremental factorization on updates.
BibTeX:
@article{Elimelech2020,
  author = {Khen Elimelech and Vadim Indelman},
  title = {Efficient Modification of the Upper Triangular Square Root Matrix on Variable Reordering},
  journal = {IEEE Robotics and Automation Letters},
  year = {2020}
}
Ellis MM (2020), "Parallelizing Irregular Applications for Distributed Memory Scalability: Case Studies from Genomics". Thesis at: University of California, Berkeley.
Abstract: Generalizable approaches, models, and frameworks for irregular application scalability is an old yet open area in parallel and distributed computing research. Irregular applications are particularly hard to parallelize and distribute because, by definition, the pattern of computation is dependent upon the input data. With the proliferation of data-driven and data-intensive applications from the realm of Big Data, and the increasing demand for and availability of large-scale computing resources through HPC-Cloud convergence, the importance of generalized approaches to achieving irregular application scalability is only growing.\ Rather than offering another software language or framework, this dissertation argues we first need to understand application scalability, especially irregular application scalability, and more closely examine patterns of computation, data sharing, and dependencies. As it stands, predominant performance models and tools from parallel and distributed computing focus on applications that are divided into distinct communication and computation phases, and ignore issues related to memory utilization. While time-tested and valuable, these models are not always sucient for understanding full application scalability, particularly, the scalability of data-intensive irregular applications. We present application case studies from genomics, highlighting the interdependencies of communication, computation, and memory capacities and performance.\ The genomics applications we will examine offer a particularly useful and practical vantage point for this analysis, as they are data-intensive irregular application targets for both HPC and cloud computing. Further, they present an extreme for both domains. For HPC, they are less akin to traditional, well-studied and well-supported scientific simulations and more akin to text and document analysis applications. For cloud computing, they are an extreme in that they require frequent random global access to memory and data, stressing interconnection network latency and bandwidth and co-scheduled processors for tightly orchestrated computation.\ We show how common patterns of irregular all-to-all computation can be managed eciently, comparing bulk-synchronous approaches built on collective communication and asynchronous approaches based on one-sided communication. For the former, our work is based on the popular Message Passing Interface (MPI) and makes heavy use of globally collective communication operations that exchange data across processors in a single step or, to save memory use, in a set of irregular steps. For the latter, we build on the UPC++ programming framework, which provides lightweight RPC mechanisms, to transfer both data and computational work between processors. We present performance results across multiple platforms including several modern HPC systems and, at least in one case, a cloud computing platform. With these application case studies, we seek not only to contribute to discussions around parallel algorithm and data structure design, programming systems, and performance modeling within the parallel computing community, but also to contribute to broader work in genomics through software development and analysis. Thus, we develop and present the first distributed memory scalable software for analyzing data sets from the latest generation of sequencing technologies, known as long read data sets. Specifically, we present scalable solutions to the problem of many-to-many long read overlap and alignment, the computational bottleneck to long read assembly, error correction, and direct analysis. Through cross-architectural empirical analysis, we identify the key components to ecient scalability, and highlight the priorities for any future optimization with analytical models.
BibTeX:
@phdthesis{Ellis2020,
  author = {Marquita May Ellis},
  title = {Parallelizing Irregular Applications for Distributed Memory Scalability: Case Studies from Genomics},
  school = {University of California, Berkeley},
  year = {2020},
  url = {https://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-133.pdf}
}
Engelmann A, Jiang Y, Benner H, Ou R, Houska B and Faulwasser T (2020), "ALADIN-α -- An open-source MATLAB toolbox for distributed non-convex optimization", June, 2020.
Abstract: This paper introduces an open-source software for distributed and decentralized non-convex optimization named ALADIN-α. ALADIN-α is a MATLAB implementation of the Augmented Lagrangian Alternating Direction Inexact Newton (ALADIN) algorithm, which is tailored towards rapid prototyping for non-convex distributed optimization. An improved version of the recently proposed bi-level variant of ALADIN is included enabling decentralized non-convex optimization. A collection of application examples from different applications fields including chemical engineering, robotics, and power systems underpins the application potential of ALADIN-α.
BibTeX:
@article{Engelmann2020,
  author = {Alexander Engelmann and Yuning Jiang and Henrieke Benner and Ruchuan Ou and Boris Houska and Timm Faulwasser},
  title = {ALADIN-α -- An open-source MATLAB toolbox for distributed non-convex optimization},
  year = {2020}
}
Eswar S, Hayashi K, Ballard G, Kannan R, Vuduc R and Park H (2020), "Distributed-Memory Parallel Symmetric Nonnegative Matrix Factorization", In Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis. Los Alamitos, CA, USA, 11, 2020. , pp. 1041-1054. IEEE Computer Society.
Abstract: We develop the first distributed-memory parallel implementation of Symmetric Nonnegative Matrix Factorization (SymNMF), a key data analytics kernel for clustering and dimensionality reduction. Our implementation includes two different algorithms for SymNMF, which give comparable results in terms of time and accuracy. The first algorithm is a parallelization of an existing sequential approach that uses solvers for nonsymmetric NMF. The second algorithm is a novel approach based on the Gauss-Newton method. It exploits second-order information without incurring large computational and memory costs. We evaluate the scalability of our algorithms on the Summit system at Oak Ridge National Laboratory, scaling up to 128 nodes (4096 cores) with 70% efficiency. Additionally, we demonstrate our software on an image segmentation task.
BibTeX:
@inproceedings{Eswar2020,
  author = {S. Eswar and K. Hayashi and G. Ballard and R. Kannan and R. Vuduc and H. Park},
  title = {Distributed-Memory Parallel Symmetric Nonnegative Matrix Factorization},
  booktitle = {Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis},
  publisher = {IEEE Computer Society},
  year = {2020},
  pages = {1041--1054},
  url = {https://doi.ieeecomputersociety.org/10.1109/SC41405.2020.00078},
  doi = {10.1109/SC41405.2020.00078}
}
Facca E and Benzi M (2020), "Fast Iterative Solution of the Optimal Transport Problem on Graphs", September, 2020.
Abstract: In this paper, we address the numerical solution of the Optimal Transport Problem on undirected weighted graphs, taking the shortest path distance as transport cost. The optimal solution is obtained from the long-time limit of the gradient descent dynamics. Among different time stepping procedures for the discretization of this dynamics, a backward Euler time stepping scheme combined with the inexact Newton-Raphson method results in a robust and accurate approach for the solution of the Optimal Transport Problem on graphs. It is found experimentally that the algorithm requires solving between 𝒪(1) and 𝒪(M^0.36) linear systems involving weighted Laplacian matrices, where M is the number of edges. These linear systems are solved via algebraic multigrid methods, resulting in an efficient solver for the Optimal Transport Problem on graphs.
BibTeX:
@article{Facca2020,
  author = {Enrico Facca and Michele Benzi},
  title = {Fast Iterative Solution of the Optimal Transport Problem on Graphs},
  year = {2020}
}
Fan T, Shuman DI, Ubaru S and Saad Y (2020), "Spectrum-Adapted Polynomial Approximation for Matrix Functions with Applications in Graph Signal Processing", Algorithms., 11, 2020. Vol. 13(11), pp. 295. MDPI AG.
Abstract: e propose and investigate two new methods to approximate f(A)b for large, sparse, Hermitian matrices A. Computations of this form play an important role in numerous signal processing and machine learning tasks. The main idea behind both methods is to first estimate the spectral density of A, and then find polynomials of a fixed order that better approximate the function f on areas of the spectrum with a higher density of eigenvalues. Compared to state-of-the-art methods such as the Lanczos method and truncated Chebyshev expansion, the proposed methods tend to provide more accurate approximations of f(A)b at lower polynomial orders, and for matrices A with a large number of distinct interior eigenvalues and a small spectral width. We also explore the application of these techniques to (i) fast estimation of the norms of localized graph spectral filter dictionary atoms, and (ii) fast filtering of time-vertex signals.
BibTeX:
@article{Fan2020,
  author = {Tiffany Fan and David I. Shuman and Shashanka Ubaru and Yousef Saad},
  title = {Spectrum-Adapted Polynomial Approximation for Matrix Functions with Applications in Graph Signal Processing},
  journal = {Algorithms},
  publisher = {MDPI AG},
  year = {2020},
  volume = {13},
  number = {11},
  pages = {295},
  doi = {10.3390/a13110295}
}
Fang J, Huang C, Tang T and Wang Z (2020), "Parallel Programming Models for Heterogeneous Many-Cores : A Survey", May, 2020.
Abstract: Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers. While heterogeneous many-core design offers the potential for energy-efficient high-performance, such potential can only be unlocked if the application programs are suitably parallel and can be made to match the underlying heterogeneous platform. In this article, we provide a comprehensive survey for parallel programming models for heterogeneous many-core architectures and review the compiling techniques of improving programmability and portability. We examine various software optimization techniques for minimizing the communicating overhead between heterogeneous computing devices. We provide a road map for a wide variety of different research areas. We conclude with a discussion on open issues in the area and potential research directions. This article provides both an accessible introduction to the fast-moving area of heterogeneous programming and a detailed bibliography of its main achievements.
BibTeX:
@article{Fang2020,
  author = {Jianbin Fang and Chun Huang and Tao Tang and Zheng Wang},
  title = {Parallel Programming Models for Heterogeneous Many-Cores : A Survey},
  year = {2020}
}
Fang Z, Zhu S, Zhang J, Liu Y, Chen Z and He Y (2020), "Low Rank Directed Acyclic Graphs and Causal Structure Learning", June, 2020.
Abstract: Despite several important advances in recent years, learning causal structures represented by directed acyclic graphs (DAGs) remains a challenging task in high dimensional settings when the graphs to be learned are not sparse. In particular, the recent formulation of structure learning as a continuous optimization problem proved to have considerable advantages over the traditional combinatorial formulation, but the performance of the resulting algorithms is still wanting when the target graph is relatively large and dense. In this paper we propose a novel approach to mitigate this problem, by exploiting a low rank assumption regarding the (weighted) adjacency matrix of a DAG causal model. We establish several useful results relating interpretable graphical conditions to the low rank assumption, and show how to adapt existing methods for causal structure learning to take advantage of this assumption. We also provide empirical evidence for the utility of our low rank algorithms, especially on graphs that are not sparse. Not only do they outperform state-of-the-art algorithms when the low rank condition is satisfied, the performance on randomly generated scale-free graphs is also very competitive even though the true ranks may not be as low as is assumed.
BibTeX:
@article{Fang2020a,
  author = {Zhuangyan Fang and Shengyu Zhu and Jiji Zhang and Yue Liu and Zhitang Chen and Yangbo He},
  title = {Low Rank Directed Acyclic Graphs and Causal Structure Learning},
  year = {2020}
}
Farhan MA, Abdelfattah A, Tomov S, Gates M, Sukkari D, Haidar A, Rosenberg R and Dongarra J (2020), "MAGMA templates for scalable linear algebra on emerging architectures", The International Journal of High Performance Computing Applications., 7, 2020. , pp. 109434202093842. SAGE Publications.
Abstract: With the acquisition and widespread use of more resources that rely on accelerator/wide vector-based computing, there has been a strong demand for science and engineering applications to take advantage of these latest assets. This, however, has been extremely challenging due to the diversity of systems to support their extreme concurrency, complex memory hierarchies, costly data movement, and heterogeneous node architectures. To address these challenges, we design a programming model and describe its ease of use in the development of a new MAGMA Templates library that delivers high-performance scalable linear algebra portable on current and emerging architectures. MAGMA Templates derives its performance and portability by (1) building on existing state-of-the-art linear algebra libraries, like MAGMA, SLATE, Trilinos, and vendor-optimized math libraries, and (2) providing access (seamlessly to the users) to the latest algorithms and architecture-specific optimizations through a single, easy-to-use C++-based API.
BibTeX:
@article{Farhan2020,
  author = {Mohammed Al Farhan and Ahmad Abdelfattah and Stanimire Tomov and Mark Gates and Dalal Sukkari and Azzam Haidar and Robert Rosenberg and Jack Dongarra},
  title = {MAGMA templates for scalable linear algebra on emerging architectures},
  journal = {The International Journal of High Performance Computing Applications},
  publisher = {SAGE Publications},
  year = {2020},
  pages = {109434202093842},
  doi = {10.1177/1094342020938421}
}
Fasi M and Higham N (2020), "Generating extreme-scale matrices with specified singular values or condition numbers"
Abstract: A widely used form of test matrix is the randsvd matrix constructed as the product A = U Σ V^*, where U and V are random orthogonal or unitary matrices from the Haar distribution and Σ is a diagonal matrix of singular values. Such matrices are random but have a specified singular value distribution. The cost of forming an m × n randsvd matrix is m^3 + n ^3 flops, which is prohibitively expensive at extreme scale; moreover, the randsvd construction requires a significant amount of communication, making it unsuitable for distributed memory environments. By dropping the requirement that U and V be Haar distributed and that both be random, we derive new algorithms for forming A that have cost linear in the number of matrix elements and require a low amount of communication and synchronization. We specialize these algorithms to generating matrices with specified 2-norm condition number. Numerical experiments show that the algorithms have excellent efficiency and scalability.
BibTeX:
@article{Fasi2020,
  author = {Fasi, Massimiliano and Higham, Nicholas},
  title = {Generating extreme-scale matrices with specified singular values or condition numbers},
  year = {2020},
  url = {http://eprints.maths.manchester.ac.uk/2755/1/fahi20.pdf}
}
Fasi M and Higham NJ (2020), "Matrices with Tunable Infinity-Norm Condition Number and No Need for Pivoting in LU Factorization"
Abstract: We propose a two-parameter family of nonsymmetric dense n × n matrices A(, ) for which LU factorization without pivoting is numerically stable, and we show how to choose α and β to achieve any value of the ∞-norm condition number. The matrix A(, ) can be formed from a simple formula in O(n^2) flops. The matrix is suitable for use in the HPL-AI Mixed-Precision Benchmark, which requires an extreme scale test matrix (dimension n > 10^7) that has a controlled condition number and can be safely used in LU factorization without pivoting. It is also of interest as a general-purpose test matrix.
BibTeX:
@article{Fasi2020b,
  author = {Fasi, Massimiliano and Higham, Nicholas J.},
  title = {Matrices with Tunable Infinity-Norm Condition Number and No Need for Pivoting in LU Factorization},
  year = {2020},
  url = {http://eprints.maths.manchester.ac.uk/id/eprint/2775}
}
Favaro F, Dufrechou E, Ezzatti P and Oliver JP (2020), "Exploring FPGA Optimizations to Compute Sparse Numerical Linear Algebra Kernels", In Applied Reconfigurable Computing. Architectures, Tools, and Applications. , pp. 258-268. Springer International Publishing.
Abstract: The solution of sparse triangular linear systems (sptrsv) is the bottleneck of many numerical methods. Thus, it is crucial to count with efficient implementations of such kernel, at least for commonly used platforms. In this sense, Field–Programmable Gate Arrays (FPGAs) have evolved greatly in the last years, entering the HPC hardware ecosystem largely due to their superior energy–efficiency relative to more established accelerators. Up until recently, the design for FPGAs implied the use of low–level Hardware Description Languages (HDL) such as VHDL or Verilog. Nowadays, manufacturers are making a large effort to adopt High–Level Synthesis languages like C/C++ or OpenCL, but the gap between their performance and that of HDLs is not yet fully studied. This work focuses on the performance offered by FPGAs to compute the sptrsv using OpenCL. For this purpose, we implement different parallel variants of this kernel and experimentally evaluate several setups, varying among others the work–group size, the number of compute units, the unroll–factor and the vectorization–factor.
BibTeX:
@incollection{Favaro2020,
  author = {Federico Favaro and Ernesto Dufrechou and Pablo Ezzatti and Juan P. Oliver},
  title = {Exploring FPGA Optimizations to Compute Sparse Numerical Linear Algebra Kernels},
  booktitle = {Applied Reconfigurable Computing. Architectures, Tools, and Applications},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {258--268},
  doi = {10.1007/978-3-030-44534-8_20}
}
Fegaras L and Noor MH (2020), "Translation of Array-Based Loops to Distributed Data-Parallel Programs", March, 2020.
Abstract: Large volumes of data generated by scientific experiments and simulations come in the form of arrays, while programs that analyze these data are frequently expressed in terms of array operations in an imperative, loop-based language. But, as datasets grow larger, new frameworks in distributed Big Data analytics have become essential tools to large-scale scientific computing. Scientists, who are typically comfortable with numerical analysis tools but are not familiar with the intricacies of Big Data analytics, must now learn to convert their loop-based programs to distributed data-parallel programs. We present a novel framework for translating programs expressed as array-based loops to distributed data parallel programs that is more general and efficient than related work. Although our translations are over sparse arrays, we extend our framework to handle packed arrays, such as tiled matrices, without sacrificing performance. We report on a prototype implementation on top of Spark and evaluate the performance of our system relative to hand-written programs.
BibTeX:
@article{Fegaras2020,
  author = {Leonidas Fegaras and Md Hasanuzzaman Noor},
  title = {Translation of Array-Based Loops to Distributed Data-Parallel Programs},
  year = {2020}
}
Feng Z (2020), "GRASS: GRAph Spectral Sparsification Leveraging Scalable Spectral Perturbation Analysis", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. , pp. 1-1.
Abstract: Spectral graph sparsification aims to find ultra-sparse subgraphs whose Laplacian matrix can well approximate the original Laplacian eigenvalues and eigenvectors. In recent years, spectral sparsification techniques have been extensively studied for accelerating various numerical and graph-related applications. Prior nearly-linear-time spectral sparsification methods first extract low-stretch spanning tree from the original graph to form the backbone of the sparsifier, and then recover small portions of spectrally-critical off-tree edges to the spanning tree to significantly improve the approximation quality. However, it is not clear how many off-tree edges should be recovered for achieving a desired spectral similarity level within the sparsifier. Motivated by recent graph signal processing techniques, this paper proposes a similarity-aware spectral graph sparsification framework that leverages efficient spectral off-tree edge embedding and filtering schemes to construct spectral sparsifiers with guaranteed spectral similarity (relative condition number) level. An iterative graph densification scheme is also introduced to facilitate efficient and effective filtering of off-tree edges for highly ill-conditioned problems. The proposed method has been validated using various kinds of graphs obtained from public domain sparse matrix collections relevant to VLSI CAD, finite element analysis, as well as social and data networks frequently studied in many machine learning and data mining applications. For instance, a sparse SDD matrix with 40 million unknowns and 180 million nonzeros can be solved (1E-3 accuracy level) within two minutes using a single CPU core and about 6GB memory.
BibTeX:
@article{Feng2020,
  author = {Z. Feng},
  title = {GRASS: GRAph Spectral Sparsification Leveraging Scalable Spectral Perturbation Analysis},
  journal = {IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems},
  year = {2020},
  pages = {1-1},
  doi = {10.1109/TCAD.2020.2968543}
}
Frandsen A and Ge R (2020), "Optimization landscape of Tucker decomposition", Mathematical Programming., 6, 2020. Springer Science and Business Media LLC.
Abstract: Tucker decomposition is a popular technique for many data analysis and machine learning applications. Finding a Tucker decomposition is a nonconvex optimization problem. As the scale of the problems increases, local search algorithms such as stochastic gradient descent have become popular in practice. In this paper, we characterize the optimization landscape of the Tucker decomposition problem. In particular, we show that if the tensor has an exact Tucker decomposition, for a standard nonconvex objective of Tucker decomposition, all local minima are also globally optimal. We also give a local search algorithm that can find an approximate local (and global) optimal solution in polynomial time.
BibTeX:
@article{Frandsen2020,
  author = {Abraham Frandsen and Rong Ge},
  title = {Optimization landscape of Tucker decomposition},
  journal = {Mathematical Programming},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  doi = {10.1007/s10107-020-01531-z}
}
Gaihre A, Li XS and Liu H (2020), "GSoFa: Scalable Sparse LU Symbolic Factorization on GPUs", July, 2020.
Abstract: Decomposing a matrix A into a lower matrix L and an upper matrix U, which is also known as LU decomposition, is an important operation in numerical linear algebra. For a sparse matrix, LU decomposition often introduces more nonzero entries in the L and U factors than the original matrix. Symbolic factorization step is needed to identify the nonzero structures of L and U matrices. Attracted by the enormous potentials of Graphics Processing Units (GPUs), an array of efforts has surged to deploy various steps of LU factorization on GPUs except, to the best of our knowledge, symbolic factorization.This paper introduces GSoFa, a GPU based Symbolic factorization design with the following three optimizations to enable scalable LU symbolic factorization for nonsymmetric pattern sparse matrices on GPUs. First, we introduce a novel fine-grained parallel symbolic factorization algorithm that is well suited for the Single Instruction Multiple Thread (SIMT) architecture of GPUs. Second, we propose multi-source concurrent symbolic factorization to improve the utilization of GPUs with focus on balancing the workload. Third, we introduce a three-pronged optimization to reduce the excessive space requirement faced by multi-source concurrent symbolic factorization. Taken together, this work scales LU symbolic factorization towards 1,000 GPUs with superior performance over the state-of-the-art CPU algorithm.
BibTeX:
@article{Gaihre2020,
  author = {Anil Gaihre and Xiaoye S. Li and Hang Liu},
  title = {GSoFa: Scalable Sparse LU Symbolic Factorization on GPUs},
  year = {2020}
}
Gao T, Lu S, Liu J and Chu C (2020), "Randomized Bregman Coordinate Descent Methods for Non-Lipschitz Optimization", January, 2020.
Abstract: We propose a new randomized Bregman (block) coordinate descent (RBCD) method for minimizing a composite problem, where the objective function could be either convex or nonconvex, and the smooth part are freed from the global Lipschitz-continuous (partial) gradient assumption. Under the notion of relative smoothness based on the Bregman distance, we prove that every limit point of the generated sequence is a stationary point. Further, we show that the iteration complexity of the proposed method is O(n-2) to achieve 𝜖-stationary point, where n is the number of blocks of coordinates. If the objective is assumed to be convex, the iteration complexity is improved to O(n-1 ). If, in addition, the objective is strongly convex (relative to the reference function), the global linear convergence rate is recovered. We also present the accelerated version of the RBCD method, which attains an O(n-1/\gamma ) iteration complexity for the convex case, where the scalar γ∊ [1,2] is determined by the generalized translation variant of the Bregman distance. Convergence analysis without assuming the global Lipschitz-continuous (partial) gradient sets our results apart from the existing works in the composite problems.
BibTeX:
@article{Gao2020,
  author = {Tianxiang Gao and Songtao Lu and Jia Liu and Chris Chu},
  title = {Randomized Bregman Coordinate Descent Methods for Non-Lipschitz Optimization},
  year = {2020}
}
Gao J, Ji W, Tan Z and Zhao Y (2020), "A Systematic Survey of General Sparse Matrix-Matrix Multiplication", February, 2020.
Abstract: SpGEMM (General Sparse Matrix-Matrix Multiplication) has attracted much attention from researchers in fields of multigrid methods and graph analysis. Many optimization techniques have been developed for certain application fields and computing architecture over the decades. The objective of this paper is to provide a structured and comprehensive overview of the research on SpGEMM. Existing optimization techniques have been grouped into different categories based on their target problems and architectures. Covered topics include SpGEMM applications, size prediction of result matrix, matrix partitioning and load balancing, result accumulating, and target architecture-oriented optimization. The rationales of different algorithms in each category are analyzed, and a wide range of SpGEMM algorithms are summarized. This survey sufficiently reveals the latest progress and research status of SpGEMM optimization from 1977 to 2019. More specifically, an experimentally comparative study of existing implementations on CPU and GPU is presented. Based on our findings, we highlight future research directions and how future studies can leverage our findings to encourage better design and implementation.
BibTeX:
@article{Gao2020a,
  author = {Jianhua Gao and Weixing Ji and Zhaonian Tan and Yueyan Zhao},
  title = {A Systematic Survey of General Sparse Matrix-Matrix Multiplication},
  year = {2020}
}
Gao W, Li Y and Lu B (2020), "Triangularized Orthogonalization-free Method for Solving Extreme Eigenvalue Problems", May, 2020.
Abstract: A novel orthogonalization-free method together with two specific algorithms are proposed to solve extreme eigenvalue problems. On top of gradient-based algorithms, the proposed algorithms modify the multi-column gradient such that earlier columns are decoupled from later ones. Global convergence to eigenvectors instead of eigenspace is guaranteed almost surely. Locally, algorithms converge linearly with convergence rate depending on eigengaps. Momentum acceleration, exact linesearch, and column locking are incorporated to further accelerate both algorithms and reduce their computational costs. We demonstrate the efficiency of both algorithms on several random matrices with different spectrum distribution and matrices from practice.
BibTeX:
@article{Gao2020b,
  author = {Weiguo Gao and Yingzhou Li and Bichen Lu},
  title = {Triangularized Orthogonalization-free Method for Solving Extreme Eigenvalue Problems},
  year = {2020}
}
Gao G, Wang Y, Vink J, Wells T and Saaf F (2020), "Distributed Quasi-Newton Derivative-Free Optimization Method for Optimization Problems with Multiple Local Optima", In Conference Proceedings, ECMOR. , pp. 1-22.
Abstract: For highly nonlinear problems, the objective function f(x) may have multiple local optima and it is desired to locate all of them. Analytical or adjoint-based derivatives may not be available for most real optimization problems, especially, when responses of a system are predicted by numerical simulations. The distributed-Gauss-Newton (DGN) optimization method performs quite efficiently and robustly for history-matching problems with multiple best matches. However, this method is not applicable for generic optimization problems, e.g., life-cycle production optimization or well location optimization.\
In this paper, we generalized the distribution techniques of the DGN optimization method and developed a new distributed quasi-Newton (DQN) optimization method that is applicable to generic optimization problems. It can handle generalized objective functions F(x,y(x))=f(x) with both explicit variables x and implicit variables, i.e., simulated responses, y(x). The partial derivatives of F(x,y) with respect to both x and y can be computed analytically, whereas the partial derivatives of y(x) with respect to x (sensitivity matrix) is estimated by applying the same efficient information sharing mechanism implemented in the DGN optimization method. An ensemble of quasi-Newton optimization tasks is distributed among multiple high-performance-computing (HPC) cluster nodes. The simulation results generated from one optimization task are shared with others by updating a common set of training data points, which records simulated responses of all simulation jobs. The sensitivity matrix at the current best solution of each optimization task is approximated by either the linear-interpolation (LI) method or the support-vector-regression (SVR) method, using some or all training data points. The gradient of the objective function is then analytically computed using its partial derivatives with respect to x and y and the estimated sensitivities of y with respect to x. The Hessian is updated using the quasi-Newton formulation. A new search point for each distributed optimization task is generated by solving a quasi-Newton trust-region subproblem for the next iteration.\
The proposed DQN method is first validated on a synthetic history matching problem and its performance is found to be comparable with the DGN optimizer. Then, the DQN method is tested on different optimization problems. For all test problems, the DQN method can find multiple optima of the objective function with reasonably small numbers of iterations (30 to 50). Compared to sequential model-based derivative-free optimization methods, the DQN method can reduce the computational cost, in terms of the number of simulations required for convergence, by a factor of 3 to 10.
BibTeX:
@inproceedings{Gao2020c,
  author = {G. Gao and Y. Wang and J. Vink and T. Wells and F. Saaf},
  title = {Distributed Quasi-Newton Derivative-Free Optimization Method for Optimization Problems with Multiple Local Optima},
  booktitle = {Conference Proceedings, ECMOR},
  year = {2020},
  pages = {1--22},
  url = {https://www.earthdoc.org/content/papers/10.3997/2214-4609.202035131}
}
Garcia-Gasulla M, Banchelli F, Peiro K, Ramirez-Gargallo G, Houzeaux G, Saïdi IBH, Tenaud C, Spisso I and Mantovani F (2020), "A generic performance analysis technique appliedto different CFD methods for HPC"
Abstract: For complex engineering and scientific applications, Computational Fluid Dynamics simulations (CFD) require a huge amount of computational power. As such, it is of paramount importance to carefully assess the performance of CFD codes and to study them in depth for enabling optimization and portability. In this paper we study three complex CFD codes, OpenFOAM, Alya and CHORUS representing two numerical methods, namely the finite volume and finite element methods, on both structured and unstructured meshes. To all codes we apply a generic performance analysis method based on a set of metrics helping the code developer in spotting the critical points that can potentially limit the scalability of a parallel application. We show the root cause of the performance bottlenecks studying the three applications on the MareNostrum4 supercomputer. We conclude providing hints for improving the performance and the scalability of each application.
BibTeX:
@article{GarciaGasulla2020,
  author = {Marta Garcia-Gasulla and Fabio Banchelli and Kilian Peiro and Guillem Ramirez-Gargallo and Guillaume Houzeaux and Ismal Ben Hassan Saïdi and Christian Tenaud and Ivan Spisso and Filippo Mantovani},
  title = {A generic performance analysis technique appliedto different CFD methods for HPC},
  year = {2020},
  url = {https://perso.limsi.fr/tenaud/Files/cfd-in-hpc_2020_accepted.pdf}
}
Garmanjani R (2020), "A note on the worst-case complexity of nonlinear stepsize control methods for convex smooth unconstrained optimization", Optimization., 10, 2020. , pp. 1-11. Informa UK Limited.
Abstract: In this paper, we analyse the worst-case complexity of nonlinear stepsize control (NSC) algorithms for solving convex smooth unconstrained optimization problems. We show that, to drive the norm of the gradient below some given positive 𝜖, such methods take at most O(-1) iterations, which shows that the complexity bound for these methods is in parity with that of gradient descent methods for the same class of problems. As NSC algorithm is a generalization of several methods such as trust-region and adaptive cubic with regularization methods, such bound holds automatically for these methods as well.
BibTeX:
@article{Garmanjani2020,
  author = {R. Garmanjani},
  title = {A note on the worst-case complexity of nonlinear stepsize control methods for convex smooth unconstrained optimization},
  journal = {Optimization},
  publisher = {Informa UK Limited},
  year = {2020},
  pages = {1--11},
  doi = {10.1080/02331934.2020.1830088}
}
Georgakoudis G, Doerfert J, Laguna I and Scogland TRW (2020), "FAROS: A Framework to Analyze OpenMP Compilation Through Benchmarking and Compiler Optimization Analysis", In OpenMP: Portable Multi-Level Parallelism on Modern Systems. , pp. 3-17. Springer International Publishing.
Abstract: Compilers optimize OpenMP programs differently than their serial elision. Early outlining of parallel regions and invocation of parallel code via OpenMP runtime functions are two of the most profound differences. Understanding the interplay between compiler optimizations, OpenMP compilation, and application performance is hard and usually requires specialized benchmarks and compilation analysis tools.\
To this end, we present FAROS, an extensible framework to automate and structure the analysis of compiler optimization of OpenMP programs. FAROS provides a generic configuration interface to profile and analyze OpenMP applications with their native build configurations. Using FAROS on a set of 39 OpenMP programs, including HPC applications and kernels, we show that OpenMP compilation hinders optimization for the majority of programs. Comparing single-threaded OpenMP execution to its sequential counterpart, we observed slowdowns as much as 135.23%. In some cases, however, OpenMP compilation speeds up execution as much as 25.48% when OpenMP semantics help compiler optimization. Following analysis on compiler optimization reports enables us to pinpoint the reasons without in-depth knowledge of the compiler. The information can be used to improve compilers and also to bring performance on par through manual code refactoring.
BibTeX:
@incollection{Georgakoudis2020,
  author = {Giorgis Georgakoudis and Johannes Doerfert and Ignacio Laguna and Thomas R. W. Scogland},
  title = {FAROS: A Framework to Analyze OpenMP Compilation Through Benchmarking and Compiler Optimization Analysis},
  booktitle = {OpenMP: Portable Multi-Level Parallelism on Modern Systems},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {3--17},
  doi = {10.1007/978-3-030-58144-2_1}
}
Ginsbach P, Collie B and O'Boyle MFP (2020), "Automatically Harnessing Sparse Acceleration", January, 2020.
Abstract: Sparse linear algebra is central to many scientific programs, yet compilers fail to optimize it well. High-performance libraries are available, but adoption costs are significant. Moreover, libraries tie programs into vendor-specific software and hardware ecosystems, creating non-portable code. In this paper, we develop a new approach based on our specification Language for implementers of Linear Algebra Computations (LiLAC). Rather than requiring the application developer to (re)write every program for a given library, the burden is shifted to a one-off description by the library implementer. The LiLAC-enabled compiler uses this to insert appropriate library routines without source code changes. LiLAC provides automatic data marshaling, maintaining state between calls and minimizing data transfers. Appropriate places for library insertion are detected in compiler intermediate representation, independent of source languages. We evaluated on large-scale scientific applications written in FORTRAN; standard C/C++ and FORTRAN benchmarks; and C++ graph analytics kernels. Across heterogeneous platforms, applications and data sets we show speedups of 1.1× to over 10× without user intervention.
BibTeX:
@article{Ginsbach2020,
  author = {Philip Ginsbach and Bruce Collie and Michael F. P. O'Boyle},
  title = {Automatically Harnessing Sparse Acceleration},
  year = {2020},
  doi = {10.1145/3377555.3377893}
}
Giraud L, Rüde U and Stals L (2020), "Resiliency in Numerical Algorithm Design for Extreme Scale Simulations". Thesis at: Dagstuhl Seminar.
Abstract: This work is based on the seminar titled "Resiliency in Numerical Algorithm Design for Extreme Scale Simulations" held March 1-6, 2020 at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of an enormous amount of resources. A typical large-scale computation running for 48 hours on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 10^23 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large scale simulation?\ Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated.\ More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar.\ The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications, and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.
BibTeX:
@techreport{Giraud2020,
  author = {Luc Giraud and Ulrich Rüde and Linda Stals},
  title = {Resiliency in Numerical Algorithm Design for Extreme Scale Simulations},
  school = {Dagstuhl Seminar},
  year = {2020},
  url = {https://drops.dagstuhl.de/opus/volltexte/2020/13429/pdf/dagrep_v010_i003_p001_20101.pdf}
}
Goebel F, Anzt H, Cojean T, Flegar G and Quintana-Ort\i ES (2020), "Multiprecision Block-Jacobi for Iterative Triangular Solves", In Euro-Par 2020: Parallel Processing. , pp. 546-560. Springer International Publishing.
Abstract: Recent research efforts have shown that Jacobi and block-Jacobi relaxation methods can be used as an effective and highly parallel approach for the solution of sparse triangular linear systems arising in the application of ILU-type preconditioners. Simultaneously, a few independent works have focused on designing efficient high performance adaptive-precision block-Jacobi preconditioning (block-diagonal scaling), in the context of the iterative solution of sparse linear systems, on manycore architectures. In this paper, we bridge the gap between relaxation methods based on regular splittings and preconditioners by demonstrating that iterative refinement can be leveraged to construct a relaxation method from the preconditioner. In addition, we exploit this insight to construct a highly-efficient sparse triangular system solver for graphics processors that combines iterative refinement with the block-Jacobi preconditioner available in the Ginkgo library.
BibTeX:
@incollection{Goebel2020,
  author = {Fritz Goebel and Hartwig Anzt and Terry Cojean and Goran Flegar and Enrique S. Quintana-Ort\i},
  title = {Multiprecision Block-Jacobi for Iterative Triangular Solves},
  booktitle = {Euro-Par 2020: Parallel Processing},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {546--560},
  doi = {10.1007/978-3-030-57675-2_34}
}
Goik D and Banaś K (2020), "A Block Preconditioner for Scalable Large Scale Finite Element Incompressible Flow Simulations", In Lecture Notes in Computer Science. , pp. 199-211. Springer International Publishing.
Abstract: We present a block preconditioner, based on the algebraic multigrid method, for solving systems of linear equations, that arise in incompressible flow simulations performed by the stabilized finite element method. We select a set of adjustable parameters for the preconditioner and show how to tune the parameters in order to obtain fast convergence of the standard GMRES solver in which the preconditioner is employed. Additionally, we show some details of the parallel implementation of the preconditioner and the achieved scalability of the solver in large scale parallel incompressible flow simulations.
BibTeX:
@incollection{Goik2020,
  author = {Damian Goik and Krzysztof Banaś},
  title = {A Block Preconditioner for Scalable Large Scale Finite Element Incompressible Flow Simulations},
  booktitle = {Lecture Notes in Computer Science},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {199--211},
  doi = {10.1007/978-3-030-50420-5_15}
}
Goli M, Narasimhan K, Reyes R, Tracy B, Soutar D and Georgiev S (2020), "Towards Cross-Platform Performance Portability ofDNN Models using SYCL", Proceedings of the 2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC.
Abstract: The incoming deployment of Exascale platforms with a myriad of different architectures and co-processors have prompted the need to provide a software ecosystem based on open standards that can simplify maintaining HPC applications on different hardware. Applications written for a particular platform should be portable to a different one, ensuring performance is as close to the peak as possible. However, it is not expected that key performance routines on relevant HPC applications will be performance portable as is, especially for common building blocks such as BLAS or DNN. The oneAPI the initiative aims to tackle this problem by combining a programming model, SYCL, with a set of interfaces for common building blocks that can be optimized for different hardware vendors. In particular, oneAPI includes the oneDNN performance library, which contains building blocks for deep learning applications and frameworks. By using the SYCL programming model, it can integrate easily with existing SYCL and C++ applications, sharing data and executing collaboratively on devices with the rest of the application. In this paper, we introduce a cuDNN backend for oneDNN, which allows running oneAPI applications on NVIDIA hardware taking advantage of existing building blocks from the CUDA ecosystem. We implement relevant neural networks (ResNet-50 and VGG16) on native CUDA and also a version of oneAPI with a CUDA backend, and demonstrate that performance portability can be achieved by leveraging existing building blocks for the target hardware.
BibTeX:
@article{Goli2020,
  author = {Mehdi Goli and Kumudha Narasimhan and Ruyman Reyes and Ben Tracy and Daniel Soutar and Svetlozar Georgiev},
  title = {Towards Cross-Platform Performance Portability ofDNN Models using SYCL},
  journal = {Proceedings of the 2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC},
  year = {2020}
}
Gómez Crespo C, Casas Guix M, Mantovani F and Focht E (2020), "Optimizing sparse matrix-vector multiplication in NEC SX-Aurora vector engine". Thesis at: Barcelona Supercomputing Centre.
Abstract: Sparse Matrix-Vector multiplication (SpMV) is an essential piece of code used in many High Performance Computing (HPC) applications. As previous literature shows, achieving efficient vectorization and performance in modern multi-core systems is nothing straightforward. It is important then to revisit the current stateof-the-art matrix formats and optimizations to be able to deliver deliver high performance in long vector architectures. In this tech-report, we describe how to develop an efficient implementation that achieves high throughput in the NEC Vector Engine: a 256 element-long vector architecture. Combining several pre-processing and kernel optimizations we obtain an average 12% improvement over a base SELLC-σ implementation on a heterogeneous set of 24 matrices.
BibTeX:
@techreport{GomezCrespo2020,
  author = {Gómez Crespo, Constantino and Casas Guix, Marc and Mantovani, Filippo and Focht, Erich},
  title = {Optimizing sparse matrix-vector multiplication in NEC SX-Aurora vector engine},
  school = {Barcelona Supercomputing Centre},
  year = {2020},
  url = {http://hdl.handle.net/2117/192586}
}
Goncalves MM, Lamb IP, Rech P, Brum RM and Azambuja JR (2020), "Improving Selective Fault Tolerance in GPU Register Files by Relaxing Application Accuracy", IEEE Transactions on Nuclear Science. , pp. 1-1.
Abstract: The high computing power of GPUs makes them attractive for safety-critical applications, where reliability is a major concern. This paper uses an approximate computing perspective to relax application accuracy in order to improve selective fault tolerance techniques. Our approach first assesses the vulnerability of a Kepler GPU to transient effects through a neutron beam experiment. Then, it performs a fault injection campaign to identify the most critical registers and relaxes result accuracy. Finally, it uses acquired data to improve selective fault tolerance techniques in terms of occupation and performance. Results show that it was possible to improve the GPU register file's reliability in an average of 71.6% by relaxing application accuracy and, when compared to selective hardening techniques, it was able to reduce replicated registers by an average of 41.4%, while maintaining 100% fault coverage.
BibTeX:
@article{Goncalves2020,
  author = {M. M. Goncalves and I. P. Lamb and P. Rech and R. M. Brum and J. R. Azambuja},
  title = {Improving Selective Fault Tolerance in GPU Register Files by Relaxing Application Accuracy},
  journal = {IEEE Transactions on Nuclear Science},
  year = {2020},
  pages = {1-1},
  doi = {10.1109/TNS.2020.2982162}
}
Gottesbüren L, Heuer T, Sanders P and Schlag S (2020), "Scalable Shared-Memory Hypergraph Partitioning", October, 2020.
Abstract: Hypergraph partitioning is an important preprocessing step for optimizing data placement and minimizing communication volumes in high-performance computing applications. To cope with ever growing problem sizes, it has become increasingly important to develop fast parallel partitioning algorithms whose solution quality is competitive with existing sequential algorithms. To this end, we present Mt-KaHyPar, the first shared-memory multilevel hypergraph partitioner with parallel implementations of many techniques used by the sequential, high-quality partitioning systems: a parallel coarsening algorithm that uses parallel community detection as guidance, initial partitioning via parallel recursive bipartitioning with work-stealing, a scalable label propagation refinement algorithm, and the first fully-parallel direct k-way formulation of the classical FM algorithm. Experiments performed on a large benchmark set of instances from various application domains demonstrate the scalability and effectiveness of our approach. With 64 cores, we observe self-relative speedups of up to 51 and a harmonic mean speedup of 23.5. In terms of solution quality, we outperform the distributed hypergraph partitioner Zoltan on 95% of the instances while also being a factor of 2.1 faster. With just four cores,Mt-KaHyPar is also slightly faster than the fastest sequential multilevel partitioner PaToH while producing better solutions on 83% of all instances. The sequential high-quality partitioner KaHyPar still finds better solutions than our parallel approach, especially when using max-flow-based refinement. This, however, comes at the cost of considerably longer running times.
BibTeX:
@article{Gottesbueren2020,
  author = {Lars Gottesbüren and Tobias Heuer and Peter Sanders and Sebastian Schlag},
  title = {Scalable Shared-Memory Hypergraph Partitioning},
  year = {2020}
}
Gou C, Zoobi AA, Benoit A, Faverge M, Marchal L, Pichon G and Ramet P (2020), "Improving Mapping for Sparse Direct Solvers", In Euro-Par 2020: Parallel Processing. , pp. 167-182. Springer International Publishing.
Abstract: In order to express parallelism, parallel sparse direct solvers take advantage of the elimination tree to exhibit tree-shaped task graphs, where nodes represent computational tasks and edges represent data dependencies. One of the pre-processing stages of sparse direct solvers consists of mapping computational resources (processors) to these tasks. The objective is to minimize the factorization time by exhibiting good data locality and load balancing. The proportional mapping technique is a widely used approach to solve this resource-allocation problem. It achieves good data locality by assigning the same processors to large parts of the elimination tree. However, it may limit load balancing in some cases. In this paper, we propose a dynamic mapping algorithm based on proportional mapping. This new approach, named Steal, relaxes the data locality criterion to improve load balancing. In order to validate the newly introduced method, we perform extensive experiments on the PaStiX sparse direct solver. It demonstrates that our algorithm enables better static scheduling of the numerical factorization while keeping good data locality.
BibTeX:
@incollection{Gou2020,
  author = {Changjiang Gou and Ali Al Zoobi and Anne Benoit and Mathieu Faverge and Loris Marchal and Grégoire Pichon and Pierre Ramet},
  title = {Improving Mapping for Sparse Direct Solvers},
  booktitle = {Euro-Par 2020: Parallel Processing},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {167--182},
  doi = {10.1007/978-3-030-57675-2_11}
}
Gower RM, Sebbouh O and Loizou N (2020), "SGD for Structured Nonconvex Functions: Learning Rates, Minibatching and Interpolation", June, 2020.
Abstract: We provide several convergence theorems for SGD for two large classes of structured non-convex functions: (i) the Quasar (Strongly) Convex functions and (ii) the functions satisfying the Polyak-Lojasiewicz condition. Our analysis relies on the Expected Residual condition which we show is a strictly weaker assumption as compared to previously used growth conditions, expected smoothness or bounded variance assumptions. We provide theoretical guarantees for the convergence of SGD for different step size selections including constant, decreasing and the recently proposed stochastic Polyak step size. In addition, all of our analysis holds for the arbitrary sampling paradigm, and as such, we are able to give insights into the complexity of minibatching and determine an optimal minibatch size. In particular we recover the best known convergence rates of full gradient descent and single element sampling SGD as a special case. Finally, we show that for models that interpolate the training data, we can dispense of our Expected Residual condition and give state-of-the-art results in this setting.
BibTeX:
@article{Gower2020,
  author = {Robert M. Gower and Othmane Sebbouh and Nicolas Loizou},
  title = {SGD for Structured Nonconvex Functions: Learning Rates, Minibatching and Interpolation},
  year = {2020}
}
Gower RM, Schmidt M, Bach F and Richtarik P (2020), "Variance-Reduced Methods for Machine Learning", October, 2020.
Abstract: Stochastic optimization lies at the heart of machine learning, and its cornerstone is stochastic gradient descent (SGD), a method introduced over 60 years ago. The last 8 years have seen an exciting new development: variance reduction (VR) for stochastic optimization methods. These VR methods excel in settings where more than one pass through the training data is allowed, achieving a faster convergence than SGD in theory as well as practice. These speedups underline the surge of interest in VR methods and the fast-growing body of work on this topic. This review covers the key principles and main developments behind VR methods for optimization with finite data sets and is aimed at non-expert readers. We focus mainly on the convex setting, and leave pointers to readers interested in extensions for minimizing non-convex functions.
BibTeX:
@article{Gower2020a,
  author = {Robert M. Gower and Mark Schmidt and Francis Bach and Peter Richtarik},
  title = {Variance-Reduced Methods for Machine Learning},
  year = {2020}
}
Gratien J-M (2020), "Introducing multi-level parallelism, at coarse, fineand instruction level to enhance the performance ofiterative solvers for large sparse linear systems on Multi- and Many-core architecture", In Proceedings of the 2020 IEEE/ACM 6th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) and Workshop on HierarchicalParallelism for Exascale Computing.
Abstract: With the evolution of High Performance Computing, multi-core and many-core systems are now a common feature of new hardware architectures. The introduction of very large number of cores at the processor level is challenging because it requires to handle multi level parallelism at various levels either coarse or fine to fully take advantage of the offered computing power. The induced programming effort can be fixed with parallel programming models based on the data flow model and the task programming paradigm [1]. To do so many of the standard numerical algorithms must be revisited as they cannot be easily parallelized at the finest levels. Iterative linear solvers are a key part of petroleum reservoir simulation as they can represent up to 80% of the total computing time. In these algorithms, the standard preconditioning methods for large, sparse and unstructured matrices – such as Incomplete LU Factorization (ILU) or Algebraic Multigrid (AMG) – fail to scale on shared-memory architectures with large number of cores. In this paper we reconsider preconditioning algorithms to better introduce multi-level parallelism at both coarse level with MPI, fine level with threads and at the instruction level to enable SIMD optimizations. This paper illustrates how we enhance the implementation of preconditioners like the multilevel domain decomposition (DDML) preconditioners [2], based on the popular Additive Schwartz Method (ASM), or the classical ILU0 preconditioner with the fine grained parallel fixed point variant presented in [3]. Our approach is validated on linear systems extracted from realistic petroleum reservoir simulations. The robustness of the preconditioners is tested with respect to the data heterogeneities of the study cases. We evaluate the extensibility of our implementation regarding the model sizes and its scalability regarding the large number of cores provided by new KNL processors or multi-nodes clusters.
BibTeX:
@inproceedings{Gratien2020,
  author = {Jean-Marc Gratien},
  title = {Introducing multi-level parallelism, at coarse, fineand instruction level to enhance the performance ofiterative solvers for large sparse linear systems on Multi- and Many-core architecture},
  booktitle = {Proceedings of the 2020 IEEE/ACM 6th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) and Workshop on HierarchicalParallelism for Exascale Computing},
  year = {2020}
}
Gratton S, Simon E, Titley-Peloquin D and Toint PL (2020), "Minimizing convex quadratics with variable precision conjugate gradients", Numerical Linear Algebra with Applications., 10, 2020. Wiley.
Abstract: We investigate the method of conjugate gradients, exploiting inaccurate matrix‐vector products, for the solution of convex quadratic optimization problems. Theoretical performance bounds are derived, and the necessary quantities occurring in the theoretical bounds estimated, leading to a practical algorithm. Numerical experiments suggest that this approach has significant potential, including in the steadily more important context of multiprecision computations.
BibTeX:
@article{Gratton2020,
  author = {Serge Gratton and Ehouarn Simon and David Titley-Peloquin and Philippe L. Toint},
  title = {Minimizing convex quadratics with variable precision conjugate gradients},
  journal = {Numerical Linear Algebra with Applications},
  publisher = {Wiley},
  year = {2020},
  doi = {10.1002/nla.2337}
}
Gross JC and Parks GT (2020), "Optimization by moving ridge functions: Derivative-free optimization for computationally intensive functions", July, 2020.
Abstract: A novel derivative-free algorithm, optimization by moving ridge functions (OMoRF), for unconstrained and bound-constrained optimization is presented. This algorithm couples trust region methodologies with output-based dimension reduction to accelerate convergence of model-based optimization strategies. The dimension-reducing subspace is updated as the trust region moves through the design space, allowing OMoRF to be applied to functions with no known global low-dimensional structure. Furthermore, its low computational requirement allows it to make rapid progress when optimizing high-dimensional functions. Its performance is examined on a set of test problems of moderate to high dimension and a high-dimensional design optimization problem. The results show that OMoRF compares favourably to other common derivative-free optimization methods, particularly when very few function evaluations are available.
BibTeX:
@article{Gross2020,
  author = {James C. Gross and Geoffrey T. Parks},
  title = {Optimization by moving ridge functions: Derivative-free optimization for computationally intensive functions},
  year = {2020}
}
Grossman M, Pritchard H, Poole S and Sarkar V (2020), "HOOVER: Leveraging OpenSHMEM for High Performance, Flexible Streaming Graph Applications", In Proceedings of the 2020 IEEE/ACM 3rd Annual Parallel Applications Workshop: Alternatives To MPI+X., November, 2020. IEEE.
Abstract: As the adoption of streaming graph applications increases in the finance, defense, social, and other industries so does the size, velocity, irregularity, and complexity of these streaming graphs. Most existing frameworks for processing streaming, dynamic graphs are too limited in scalability or functionality to efficiently handle these increasingly massive graphs. Many frameworks are either built for shared memory platforms only (limiting the size of the graph that can be stored) or for distributed platforms, but run on slow, high overhead, interpreted, and bulk synchronous platforms. This paper introduces HOOVER, a high performance streaming graph modeling and analysis framework built from scratch to scale on high performance systems and extremely dynamic graphs. HOOVER offers similar APIs to previous streaming graph frameworks, but sits on top of a high performance runtime system designed for modern supercomputers. HOOVER leverages an eventually-consistent consistency model to improve scalability, and offers a number of unique features to users. On micro-benchmarks, HOOVER is shown to be comparable or faster than existing high performance and distributed graph frameworks. Using mini-apps, we also show that HOOVER easily scales to 2,048 PEs on more realistic applications.
BibTeX:
@inproceedings{Grossman2020,
  author = {Max Grossman and Howard Pritchard and Steve Poole and Vivek Sarkar},
  title = {HOOVER: Leveraging OpenSHMEM for High Performance, Flexible Streaming Graph Applications},
  booktitle = {Proceedings of the 2020 IEEE/ACM 3rd Annual Parallel Applications Workshop: Alternatives To MPI+X},
  publisher = {IEEE},
  year = {2020},
  doi = {10.1109/pawatm51920.2020.00010}
}
Grützmacher T, Cojean T, Flegar G, Anzt H and Quintana-Ortı́ ES (2020), "Acceleration of PageRank with Customized Precision Based on Mantissa Segmentation", ACM Transactions on Parallel Computing., 3, 2020. Vol. 7(1), pp. 1-19. Association for Computing Machinery (ACM).
Abstract: We describe the application of a communication-reduction technique for the PageRank algorithm that dynamically adapts the precision of the data access to the numerical requirements of the algorithm as the iteration converges. Our variable-precision strategy, using a customized precision format based on mantissa segmentation (CPMS), abandons the IEEE 754 single- and double-precision number representation formats employed in the standard implementation of PageRank, and instead handles the data in memory using a customized floating-point format. The customized format enables fast data access in different accuracy, prevents overflow/underflow by preserving the IEEE 754 double-precision exponent, and efficiently avoids data duplication, since all bits of the original IEEE 754 double-precision mantissa are preserved in memory, but re-organized for efficient reduced precision access. With this approach, the truncated values (omitting significand bits), as well as the original IEEE double-precision values, can be retrieved without duplicating the data in different formats.\ Our numerical experiments on an NVIDIA V100 GPU (Volta architecture) and a server equipped with two Intel Xeon Platinum 8168 CPUs (48 cores in total) expose that, compared with a standard IEEE double-precision implementation, the CPMS-based PageRank completes about 10% faster if high-accuracy output is needed, and about 30% faster if reduced output accuracy is acceptable.
BibTeX:
@article{Gruetzmacher2020,
  author = {Thomas Grützmacher and Terry Cojean and Goran Flegar and Hartwig Anzt and Enrique S. Quintana-Ortı́},
  title = {Acceleration of PageRank with Customized Precision Based on Mantissa Segmentation},
  journal = {ACM Transactions on Parallel Computing},
  publisher = {Association for Computing Machinery (ACM)},
  year = {2020},
  volume = {7},
  number = {1},
  pages = {1--19},
  doi = {10.1145/3380934}
}
Gryazin YA and Spielman RB (2020), "Parallel Direct Regularized Solver for Power Circuit Applications", In Advances in Intelligent Systems and Computing., 11, 2020. , pp. 193-204. Springer International Publishing.
Abstract: In this paper, a new direct parallel linear solver for a variety of pulsed power applications is presented. This algorithm is the core of the electrical circuit simulator Screamer considered in our previous publications. The main idea of the underlying algorithm is to use graph partitioning to break the problem tree into a series of branches that can be solved in parallel. Then this division is used in the recursive elimination process to guarantee the high efficiency of the direct method. The graph partitioning in this problem is straightforward since it corresponds to the branch structure of the underlying circuit. The partitioning method naturally leads to the parallel implementation of the factorization and solution steps. The numerical results of test problems confirm the high efficiency of the suggested direct algorithm.
BibTeX:
@incollection{Gryazin2020,
  author = {Yury A. Gryazin and Rick B. Spielman},
  title = {Parallel Direct Regularized Solver for Power Circuit Applications},
  booktitle = {Advances in Intelligent Systems and Computing},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {193--204},
  doi = {10.1007/978-3-030-63089-8_12}
}
Gu Z, Moreira J, Edelsohn D and Azad A (2020), "Bandwidth-Optimized Parallel Algorithms for Sparse Matrix-Matrix Multiplication using Propagation Blocking", February, 2020.
Abstract: Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in various graph, scientific computing and machine learning algorithms. It is well known that SpGEMM is a memory-bound operation, and its peak performance is expected to be bound by the memory bandwidth. Yet, existing algorithms fail to saturate the memory bandwidth, resulting in suboptimal performance under the Roofline model. In this paper we characterize existing SpGEMM algorithms based on their memory access patterns and develop practical lower and upper bounds for SpGEMM performance. We then develop an SpGEMM algorithm based on outer product matrix multiplication. The newly developed algorithm called PB-SpGEMM saturates memory bandwidth by using the propagation blocking technique and by performing in-cache sorting and merging. For many practical matrices, PB-SpGEMM runs 20%-50% faster than the state-of-the-art heap and hash SpGEMM algorithms on modern multicore processors. Most importantly, PB-SpGEMM attains performance predicted by the Roofline model, and its performance remains stable with respect to matrix size and sparsity.
BibTeX:
@article{Gu2020,
  author = {Zhixiang Gu and Jose Moreira and David Edelsohn and Ariful Azad},
  title = {Bandwidth-Optimized Parallel Algorithms for Sparse Matrix-Matrix Multiplication using Propagation Blocking},
  year = {2020}
}
Güttel S, Kressner D and Lund K (2020), "Limited-memory polynomial methods for large-scale matrix functions", February, 2020.
Abstract: Matrix functions are a central topic of linear algebra, and problems requiring their numerical approximation appear increasingly often in scientific computing. We review various limited-memory methods for the approximation of the action of a large-scale matrix function on a vector. Emphasis is put on polynomial methods, whose memory requirements are known or prescribed a priori. Methods based on explicit polynomial approximation or interpolation, as well as restarted Arnoldi methods, are treated in detail. An overview of existing software is also given, as well as a discussion of challenging open problems.
BibTeX:
@article{Guettel2020,
  author = {Stefan Güttel and Daniel Kressner and Kathryn Lund},
  title = {Limited-memory polynomial methods for large-scale matrix functions},
  year = {2020}
}
Guidi G, Selvitopi O, Ellis M, Oliker L, Yelick K and Buluc A (2020), "Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly", October, 2020.
Abstract: One of the most computationally intensive tasks in computational biology is de novo genome assembly, the decoding of the sequence of an unknown genome from redundant and erroneous short sequences. A common assembly paradigm identifies overlapping sequences, simplifies their layout, and creates consensus. Despite many algorithms developed in the literature, the efficient assembly of large genomes is still an open problem. In this work, we introduce new distributed-memory parallel algorithms for overlap detection and layout simplification steps of de novo genome assembly, and implement them in the diBELLA 2D pipeline. Our distributed memory algorithms for both overlap detection and layout simplification are based on linear-algebra operations over semirings using 2D distributed sparse matrices. Our layout step consists of performing a transitive reduction from the overlap graph to a string graph. We provide a detailed communication analysis of the main stages of our new algorithms. diBELLA 2D achieves near linear scaling with over 80% parallel efficiency for the human genome, reducing the runtime for overlap detection by 1.2-1.3× for the human genome and 1.5-1.9× for C. elegans compared to the state-of-the-art. Our transitive reduction algorithm outperforms an existing distributed-memory implementation by 10.5-13.3× for the human genome and 18-29× for the C. elegans. Our work paves the way for efficient de novo assembly of large genomes using long reads in distributed memory.
BibTeX:
@article{Guidi2020,
  author = {Giulia Guidi and Oguz Selvitopi and Marquita Ellis and Leonid Oliker and Katherine Yelick and Aydin Buluc},
  title = {Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly},
  year = {2020}
}
Guidi G, Ellis M, Buluc A, Yelick K and Culler D (2020), "10 Years Later: Cloud Computing is Closing the Performance Gap", November, 2020.
Abstract: Large scale modeling and simulation problems, from nanoscale materials to universe-scale cosmology, have in the past used the massive computing resources of High-Performance Computing (HPC) systems. Over the last decade, cloud computing has gained popularity for business applications and increasingly for computationally intensive machine learning problems. Despite the prolific literature, the question remains open whether cloud computing can provide HPC-competitive performance for a wide range of scientific applications.\ The answer to this question is crucial in guiding the design of future systems and providing access to high-performance resources to a broadened community. Here we present a multi-level approach to identifying the performance gap between HPC and cloud computing and to isolate several variables that contribute to this gap by dividing our experiments into (i) hardware and system microbenchmarks and (ii) user applications. \Our results show that today's high-end cloud computing can deliver HPC-like performance - at least at modest scales - not only for computationally intensive applications, but also for memory- and communication-intensive applications, thanks to the high-speed memory systems and interconnects and dedicated batch scheduling now available on some cloud platforms.
BibTeX:
@article{Guidi2020a,
  author = {Giulia Guidi and Marquita Ellis and Aydin Buluc and Katherine Yelick and David Culler},
  title = {10 Years Later: Cloud Computing is Closing the Performance Gap},
  year = {2020}
}
Guler B, Avestimehr S and Ortega A (2020), "TACC: Topology-Aware Coded Computing for Distributed Graph Processing", IEEE Transactions on Signal and Information Processing over Networks. Institute of Electrical and Electronics Engineers (IEEE).
Abstract: This paper proposes a coded distributed graph processing framework to alleviate the communication bottleneck in large-scale distributed graph processing. In particular, we propose a topology-aware coded computing (TACC) algorithm that has two novel salient features: (i) a topology-aware graph allocation strategy, and (ii) a coded aggregation scheme that combines the intermediate computations for graph processes while constructing coded messages. The proposed setup results in a trade-off between computation and communication, in that increasing the computation load at the distributed parties can in turn reduce the communication load. We demonstrate the effectiveness of the TACC algorithm by comparing the communication load with existing setups on both Erdos-Renyi and Barabasi-Albert type random graphs, as well as real-world Google web graph for PageRank computations. In particular, we show that the proposed coding strategy can lead to up to 82% reduction in communication load and up to 46% reduction overall execution time, when compared to the state-of-the-art and implemented on the Amazon EC2 cloud compute platform.
BibTeX:
@article{Guler2020,
  author = {Basak Guler and Salman Avestimehr and Antonio Ortega},
  title = {TACC: Topology-Aware Coded Computing for Distributed Graph Processing},
  journal = {IEEE Transactions on Signal and Information Processing over Networks},
  publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
  year = {2020},
  doi = {10.1109/tsipn.2020.2998223}
}
Guo H and Rubio-González C (2020), "Efficient Generation of Error-Inducing Floating-Point Inputsvia Symbolic Execution", In Proceedings of the 42nd IEEE/ACM International Conference on Software Engineering.
Abstract: Floating point is widely used in software to emulate arithmetic over reals. Unfortunately, floating point leads to rounding errors that propagate and accumulate during execution. Generating inputs to maximize the numerical error is critical when evaluating the accuracy of floating-point code. In this paper, we formulate the problem of generating high error-inducing floating-point inputs as a code coverage maximization problem solved using symbolic execution. Specifically, we define inaccuracy checks to detect large precision loss and cancellation. We inject these checks at strategic program locations to construct specialized branches that, when covered by a given input, are likely to lead to large errors in the result. We apply symbolic execution to generate inputs that exercise these specialized branches, and describe optimizations that make our approach practical. We implement a tool named FPGen and present an evaluation on 21 numerical programs including matrix computation and statistics libraries. We show that FPGen exposes errors for 20 of these programs and triggers errors that are, on average, over 2 orders of magnitude larger than the state of the art.
BibTeX:
@inproceedings{Guo2020,
  author = {Hui Guo and Cindy Rubio-González},
  title = {Efficient Generation of Error-Inducing Floating-Point Inputsvia Symbolic Execution},
  booktitle = {Proceedings of the 42nd IEEE/ACM International Conference on Software Engineering},
  year = {2020},
  url = {https://hguo15.github.io/huiguo.github.io/files/fpgen-icse20.pdf}
}
Guo M and Wang S (2020), "Quantum Computing for Solving Spatial Optimization Problems", In Geotechnologies and the Environment. , pp. 97-113. Springer International Publishing.
Abstract: Ever since Shor'''s quantum factoring algorithm was developed, quantum computing has been pursued as a promising and powerful approach to solving many computationally complex problems such as combinatorial optimization and machine learning. As an important quantum computing approach, quantum annealing (QA) has received considerable attention. Extensive research has shown that QA, exploiting quantum-mechanical effects such as tunneling, entanglement and superposition, could be much more efficient in solving hard combinatorial optimization problems than its classical counterpart -- simulated annealing. Recent advances in quantum annealing hardware open the possibility of empirical testing of QA against the most challenging computational problems arising in geospatial applications. This chapter demonstrates how to employ QA to solve NP-hard spatial optimization problems through an illustrative example of programming a p-median model and a case study on spatial supply chain optimization. The research findings also address the short- and long-term potential of quantum computing in the future development of high-performance computing for geospatial applications.
BibTeX:
@incollection{Guo2020a,
  author = {Mengyu Guo and Shaowen Wang},
  title = {Quantum Computing for Solving Spatial Optimization Problems},
  booktitle = {Geotechnologies and the Environment},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {97--113},
  doi = {10.1007/978-3-030-47998-5_6}
}
Guo H, Laguna I and Rubio-González C (2020), "pLiner: Isolating Lines of Floating-Point Code for Compiler-Induced Variability", In Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis. Los Alamitos, CA, USA, 11, 2020. , pp. 680-693. IEEE Computer Society.
Abstract: Scientific applications are often impacted by numerical inconsistencies when using different compilers or when a compiler is used with different optimization levels; such inconsistencies hinder reproducibility and can be hard to diagnose. We present PLINER, a tool to automatically pinpoint code lines that trigger compiler-induced variability. PLINER uses a novel approach to enhance floating-point precision at different levels of code granularity, and performs a guided search to identify locations affected by numerical inconsistencies. We demonstrate PLINER on a real-world numerical inconsistency that required weeks to diagnose, which PLINER isolates in minutes. We also evaluate PLINER on 100 synthetic programs, and the NAS Parallel Benchmarks (NPB). On the synthetic programs, PLINER detects the affected lines of code 87% of the time while the stateof- the-art approach only detects the affected lines 6% of the time. Furthermore, PLINER successfully isolates all numerical inconsistencies found in the NPB.
BibTeX:
@inproceedings{Guo2020b,
  author = {H. Guo and I. Laguna and C. Rubio-González},
  title = {pLiner: Isolating Lines of Floating-Point Code for Compiler-Induced Variability},
  booktitle = {Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis},
  publisher = {IEEE Computer Society},
  year = {2020},
  pages = {680--693},
  url = {https://doi.ieeecomputersociety.org/10.1109/SC41405.2020.00053},
  doi = {10.1109/SC41405.2020.00053}
}
Gusmeroli N, Hrga T, Lužar B, Povh J, Siebenhofer M and Wiegele A (2020), "BiqBin: a parallel branch-and-bound solver for binary quadratic problems with linear constraints", September, 2020.
Abstract: We present BiqBin, an exact solver for linearly constrained binary quadratic problems. Our approach is based on an exact penalty method to first efficiently transform the original problem into an instance of Max-Cut, and then to solve the Max-Cut problem by a branch-and-bound algorithm. All the main ingredients are carefully developed using new semidefinite programming relaxations obtained by strengthening the existing relaxations with a set of hypermetric inequalities, applying the bundle method as the bounding routine and using new strategies for exploring the branch-and-bound tree. Furthermore, an efficient C implementation of a sequential and a parallel branch-and-bound algorithm is presented. The latter is based on a load coordinator-worker scheme using MPI for multi-node parallelization and is evaluated on a high-performance computer. The new solver is benchmarked against BiqCrunch, GUROBI, and SCIP on four families of (linearly constrained) binary quadratic problems. Numerical results demonstrate that BiqBin is a highly competitive solver. The serial version outperforms the other three solvers on the majority of the benchmark instances. We also evaluate the parallel solver and show that it has good scaling properties. The general audience can use it as an on-line service available at http://www.biqbin.eu.
BibTeX:
@article{Gusmeroli2020,
  author = {Nicolò Gusmeroli and Timotej Hrga and Borut Lužar and Janez Povh and Melanie Siebenhofer and Angelika Wiegele},
  title = {BiqBin: a parallel branch-and-bound solver for binary quadratic problems with linear constraints},
  year = {2020}
}
Haidar A, Bayraktar H, Tomov S, Dongarra J and Higham NJ (2020), "Mixed-Precision IterativeRefinement using TensorCores on GPUs to AccelerateSolution of Linear Systems", Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. The Royal Society.
Abstract: Double-precision floating-point arithmetic (FP64) has been the de facto standard for engineering and scientific simulations for several decades. Problem complexity and the sheer volume of data coming from various instruments and sensors motivate researchers to mix and match various approaches to optimize compute resources, including different levels of floating-point precision. In recent years, machine learning has motivated hardware support for halfprecision floating-point arithmetic. A primary challenge in high-performance computing is to leverage reducedprecision and mixed-precision hardware. We show how the FP16/FP32 Tensor Cores on NVIDIA GPUs can be exploited to accelerate the solution of linear systems of equations Ax = b without sacrificing numerical stability. On the NVIDIA Quadro GV100 (Volta) GPU, we achieve a 4× -- 5× performance increase and 5× better energy efficiency versus the standard FP64 implementation while maintaining an FP64 level of numerical stability.
BibTeX:
@article{Haidar2020,
  author = {Azzam Haidar and Harun Bayraktar and Stanimire Tomov and Jack Dongarra and Nicholas J. Higham},
  title = {Mixed-Precision IterativeRefinement using TensorCores on GPUs to AccelerateSolution of Linear Systems},
  journal = {Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences},
  publisher = {The Royal Society},
  year = {2020}
}
Hamilton KE, Schuman CD, Young SR, Bennink RS, Imam N and Humble TS (2020), "Accelerating Scientific Computing in the Post-Moore's Era", ACM Transactions on Parallel Computing., 3, 2020. Vol. 7(1), pp. 1-31. Association for Computing Machinery (ACM).
Abstract: Novel uses of graphical processing units for accelerated computation revolutionized the field of high-performance scientific computing by providing specialized workflows tailored to algorithmic requirements. As the era of Moore's law draws to a close, many new non–von Neumann processors are emerging as potential computational accelerators, including those based on the principles of neuromorphic computing, tensor algebra, and quantum information. While development of these new processors is continuing to mature, the potential impact on accelerated computing is anticipated to be profound. We discuss how different processing models can advance computing in key scientific paradigms: machine learning and constraint satisfaction. Significantly, each of these new processor types utilizes a fundamentally different model of computation, and this raises questions about how to best use such processors in the design and implementation of applications. While many processors are being developed with a specific domain target, the ubiquity of spin-glass models and neural networks provides an avenue for multi-functional applications. This also hints at the infrastructure needed to integrate next-generation processing units into future high-performance computing systems.
BibTeX:
@article{Hamilton2020,
  author = {Kathleen E. Hamilton and Catherine D. Schuman and Steven R. Young and Ryan S. Bennink and Neena Imam and Travis S. Humble},
  title = {Accelerating Scientific Computing in the Post-Moore's Era},
  journal = {ACM Transactions on Parallel Computing},
  publisher = {Association for Computing Machinery (ACM)},
  year = {2020},
  volume = {7},
  number = {1},
  pages = {1--31},
  doi = {10.1145/3380940}
}
Han J, Rafique MM, Xu L, Butt AR, Lim S-H and Vazhkudai SS (2020), "MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems", In Proceedings of the 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing., 5, 2020. IEEE.
Abstract: Deep learning (DL) has become a key tool for solving complex scientific problems. However, managing the multi-dimensional large-scale data associated with DL, especially atop extant multiple graphics processing units (GPUs) in modern supercomputers poses significant challenges. Moreover, the latest high-performance computing (HPC) architectures bring different performance trends in training throughput compared to the existing studies. Existing DL optimizations such as larger batch size and GPU locality-aware scheduling have little effect on improving DL training throughput performance due to fast CPU-to-GPU connections. Additionally, DL training on multiple GPUs scales sublinearly. Thus, simply adding more GPUs to a system is ineffective. To this end, we design MARBLE, a first-of-its-kind job scheduler, which considers the non-linear scalability of GPUs at the intra-node level to schedule an appropriate number of GPUs per node for a job. By sharing the GPU resources on a node with multiple DL jobs, MARBLE avoids low GPU utilization in current multi-GPU DL training on HPC systems. Our comprehensive evaluation in the Summit supercomputer shows that MARBLE is able to improve DL training performance by up to 48.3% compared to the popular Platform Load Sharing Facility (LSF) scheduler. Compared to the state-of-the-art of DL scheduler, Optimus, MARBLE reduces the job completion time by up to 47%.
BibTeX:
@inproceedings{Han2020,
  author = {Jingoo Han and M. Mustafa Rafique and Luna Xu and Ali R. Butt and Seung-Hwan Lim and Sudharshan S. Vazhkudai},
  title = {MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems},
  booktitle = {Proceedings of the 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing},
  publisher = {IEEE},
  year = {2020},
  doi = {10.1109/ccgrid49817.2020.00-66}
}
Han Q, Yang H, Dun M, Luan Z, Gan L, Yang G and Qian D (2020), "Towards efficient tile low-rank GEMM computation on sunway many-core processors", The Journal of Supercomputing., 10, 2020. Springer Science and Business Media LLC.
Abstract: Tile low-rank general matrix multiplication (TLR GEMM) is a novel method of matrix multiplication on large data-sparse matrices, which can significantly reduce storage footprint and arithmetic complexity under given accuracy. To implement high-performance TLR GEMM on Sunway many-core processor, the following challenges remain to be addressed: 1) design an efficient parallel scheme; 2) provide an efficient kernel library of math functions commonly used in TLR GEMM. This paper proposes swTLR GEMM, an efficient implementation of TLR GEMM. We assign LR GEMM computation to a single computing processing element (CPE) and use grouped task queue to process different data tiles of the TLR matrix. Moreover, we implement an efficient kernel library (swLR Kernels) for low-rank matrix operations. To scale to massive (CGs), we organize the CGs into the CG grid and partition the matrices into blocks accordingly. We also apply Cannon's algorithm to enable efficient communication when processing the matrix blocks across CGs simultaneously. The experiment results show that the DGEMM kernel in swLR Kernels achieves 102× speedup on average. In terms of overall performance, swTLR GEMM-LLD and swTLR GEMM-LLL achieve 91× and 20.1× speedup on average, respectively. In addition, our implementation of swTLR GEMM exhibits good scalability when running on 1,024 CGs of Sunway processors (66,560 cores in total).
BibTeX:
@article{Han2020a,
  author = {Qingchang Han and Hailong Yang and Ming Dun and Zhongzhi Luan and Lin Gan and Guangwen Yang and Depei Qian},
  title = {Towards efficient tile low-rank GEMM computation on sunway many-core processors},
  journal = {The Journal of Supercomputing},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  doi = {10.1007/s11227-020-03444-2}
}
Hanford N, Pankajakshan R, Leon EA and Karlin I (2020), "Challenges of GPU-aware Communication in MPI", In Proceedings of the 2020 Workshop on Exascale MPI.
Abstract: GPUs are increasingly popular in HPC systems and applications. However, the communication bottleneck between GPUs, distributed across HPC nodes within a cluster, has limited achievable scalability of GPU-centric applications. Advances in inter-node GPU communication such as NVIDIA's GPUDirect have made great strides in addressing this issue. The added software development complexity has been addressed by simplified GPU programming paradigms such as Unified or Managed Memory. To understand the performance of these new features, new benchmarks were developed. Unfortunately, these benchmark efforts do not include correctness checking and certain messaging patterns used in applications. In this paper we highlight important gaps in communication benchmarks and motivate a methodology to help application developers understand the performance tradeoffs of different data movement options. Furthermore, we share systems tuning and deployment experiences across different GPU-aware MPI implementations. In particular, we demonstrate correctness testing is needed along with performance testing through modifications to an existing benchmark. In addition, we present a case study where existing benchmarks fail to characterize how data is moved within SW4, a seismic wave application, and create a benchmark to model this behavior. Finally, we motivate the need for an applicationinspired benchmark methodology to assess system performance and guide application programmers on how to use the system more efficiently.
BibTeX:
@inproceedings{Hanford2020,
  author = {Nathan Hanford and Ramesh Pankajakshan and Edgar A. Leon and Ian Karlin},
  title = {Challenges of GPU-aware Communication in MPI},
  booktitle = {Proceedings of the 2020 Workshop on Exascale MPI},
  year = {2020}
}
Hanzely F, Doikov N, Richtárik P and Nesterov Y (2020), "Stochastic Subspace Cubic Newton Method", February, 2020.
Abstract: In this paper, we propose a new randomized second-order optimization algorithm---Stochastic Subspace Cubic Newton (SSCN)---for minimizing a high dimensional convex function f. Our method can be seen both as a stochastic extension of the cubically-regularized Newton method of Nesterov and Polyak (2006), and a second-order enhancement of stochastic subspace descent of Kozak et al. (2019). We prove that as we vary the minibatch size, the global convergence rate of SSCN interpolates between the rate of stochastic coordinate descent (CD) and the rate of cubic regularized Newton, thus giving new insights into the connection between first and second-order methods. Remarkably, the local convergence rate of SSCN matches the rate of stochastic subspace descent applied to the problem of minimizing the quadratic function 12 (x-x^*)^top 2f(x^*)(x-x^*), where x^* is the minimizer of f, and hence depends on the properties of f at the optimum only. Our numerical experiments show that SSCN outperforms non-accelerated first-order CD algorithms while being competitive to their accelerated variants.
BibTeX:
@article{Hanzely2020,
  author = {Filip Hanzely and Nikita Doikov and Peter Richtárik and Yurii Nesterov},
  title = {Stochastic Subspace Cubic Newton Method},
  year = {2020}
}
Hanzely F (2020), "Optimization for Supervised Machine Learning: Randomized Algorithms for Data and Parameters", August, 2020.
Abstract: Many key problems in machine learning and data science are routinely modeled as optimization problems and solved via optimization algorithms. With the increase of the volume of data and the size and complexity of the statistical models used to formulate these often ill-conditioned optimization tasks, there is a need for new efficient algorithms able to cope with these challenges. In this thesis, we deal with each of these sources of difficulty in a different way. To efficiently address the big data issue, we develop new methods which in each iteration examine a small random subset of the training data only. To handle the big model issue, we develop methods which in each iteration update a random subset of the model parameters only. Finally, to deal with ill-conditioned problems, we devise methods that incorporate either higher-order information or Nesterov's acceleration/momentum. In all cases, randomness is viewed as a powerful algorithmic tool that we tune, both in theory and in experiments, to achieve the best results. Our algorithms have their primary application in training supervised machine learning models via regularized empirical risk minimization, which is the dominant paradigm for training such models. However, due to their generality, our methods can be applied in many other fields, including but not limited to data science, engineering, scientific computing, and statistics.
BibTeX:
@article{Hanzely2020a,
  author = {Filip Hanzely},
  title = {Optimization for Supervised Machine Learning: Randomized Algorithms for Data and Parameters},
  year = {2020},
  doi = {10.25781/KAUST-4F2DH}
}
Harris S, Chamberlain RD and Gill C (2020), "OpenCL Performance on the Intel Heterogeneous Architecture Research Platform", In Proceedings of the IEEE High-Performance Extreme Computing Conference.
Abstract: The fundamental operation of matrix multiplication is ubiquitous across a myriad of disciplines. Yet, the identification of new optimizations for matrix multiplication remains relevant for emerging hardware architectures and heterogeneous systems. Frameworks such as OpenCL enable computation orchestration on existing systems, and its availability using the Intel High Level Synthesis compiler allows users to architect new designs for reconfigurable hardware using C/C++. Using the HARPv2 as a vehicle for exploration, we investigate the utility of several traditional matrix multiplication optimizations to better understand the performance portability of OpenCL and the implications for such optimizations on cache coherent heterogeneous architectures. Our results give targeted insights into the applicability of best practices that were designed for existing architectures when used on emerging heterogeneous systems.
BibTeX:
@inproceedings{Harris2020,
  author = {Steven Harris and Roger D. Chamberlain and Christopher Gill},
  title = {OpenCL Performance on the Intel Heterogeneous Architecture Research Platform},
  booktitle = {Proceedings of the IEEE High-Performance Extreme Computing Conference},
  year = {2020},
  url = {https://www.cse.wustl.edu/ roger/papers/hcg20.pdf}
}
He X, Pal S, Amarnath A, Feng S, Park D-H, Rovinski A, Ye H and Yu (2020), "Sparse-TPU: Adapting Systolic Arrays for Sparse Matrices", In Proceedings of the International Conference on Supercomputing.
Abstract: While systolic arrays are widely used for dense-matrix operations, they are seldom used for sparse-matrix operations. In this paper, we show how a systolic array of Multiply-and-Accumulate (MAC) units, similar to Google's Tensor Processing Unit (TPU), can be adapted to efficiently handle sparse matrices. TPU-like accelerators are built upon a 2D array of MAC units and have demonstrated high throughput and efficiency for dense matrix multiplication, which is a key kernel in machine learning algorithms and is the target of the TPU. In this work, we employ a co-designed approach of first developing a packing technique to condense a sparse matrix and then propose a systolic array based system, Sparse-TPU, abbreviated to STPU, to accommodate the matrix computations for the packed denser matrix counterparts. To demonstrate the efficacy of our co-designed approach, we evaluate sparse matrix-vector multiplication on a broad set of synthetic and real-world sparse matrices. Experimental results show that STPU delivers 16.08× higher performance while consuming 4.39× and 19.79× lower energy for integer (int8) and floating point (float32) implementations, respectively, over a TPU baseline. Meanwhile, STPU has 12.93% area overhead and an average of 4.14% increase in dynamic energy over the TPU baseline for the float32 implementation.
BibTeX:
@inproceedings{He2020,
  author = {Xin He and Subhankar Pal and Aporva Amarnath and Siying Feng and Dong-Hyeon Park and Austin Rovinski and Haojie Ye and Yu},
  title = {Sparse-TPU: Adapting Systolic Arrays for Sparse Matrices},
  booktitle = {Proceedings of the International Conference on Supercomputing},
  year = {2020},
  url = {https://web.eecs.umich.edu/ subh/publication/stpu-ics20/stpu-ics20.pdf}
}
He X, Wang X, Shi J and Liu Y (2020), "Testing high performance numerical simulation programs: experience, lessons learned, and open issues", In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis., 7, 2020. ACM.
Abstract: High performance numerical simulation programs are widely used to simulate actual physical processes on high performance computers for the analysis of various physical and engineering problems. They are usually regarded as non-testable due to their high complexity. This paper reports our real experience and lessons learned from testing five simulation programs that will be used to design and analyze nuclear power plants. We applied five testing approaches and found 33 bugs. We found that property-based testing and metamorphic testing are two effective methods. Nevertheless, we suffered from the lack of domain knowledge, the high test costs, the shortage of test cases, severe oracle issues, and inadequate automation support. Consequently, the five programs are not exhaustively tested from the perspective of software testing, and many existing software testing techniques and tools are not fully applicable due to scalability and portability issues. We need more collaboration and communication with other communities to promote the research and application of software testing techniques.
BibTeX:
@inproceedings{He2020a,
  author = {Xiao He and Xingwei Wang and Jia Shi and Yi Liu},
  title = {Testing high performance numerical simulation programs: experience, lessons learned, and open issues},
  booktitle = {Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis},
  publisher = {ACM},
  year = {2020},
  doi = {10.1145/3395363.3397382}
}
He G, Vialle S and Baboulin M (2020), "Parallelization of the k-means Algorithm in a Spectral Clustering Chain on CPU-GPU Platforms", Proceedings of the Conference on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms.
Abstract: k-means is a standard algorithm for clustering data. It constitutes generally the final step in a more complex chain of high quality spectral clustering. However this chain suffers from lack of scalability when addressing large datasets. This can be overcome by applying also the k-means algorithm as a pre-processing task to reduce the input data instances. We describe parallel optimization techniques for the k-means algorithm on CPU and GPU. Experimental results on synthetic dataset illustrate the numerical accuracy and performance of our implementations.
BibTeX:
@article{He2020b,
  author = {Guanlin He and Stéphane Vialle and Marc Baboulin},
  title = {Parallelization of the k-means Algorithm in a Spectral Clustering Chain on CPU-GPU Platforms},
  journal = {Proceedings of the Conference on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms},
  year = {2020},
  url = {https://www.lri.fr/ baboulin/heteropar2020.pdf}
}
Hebling GM, Massignan JA, Junior JBL and Camillo MH (2020), "Sparse and numerically stable implementation of a distribution system state estimation based on Multifrontal QR factorization", Electric Power Systems Research., 12, 2020. Vol. 189, pp. 106734. Elsevier BV.
Abstract: Enhancing situational awareness of distribution networks is a requirement of Smart Grids. In order to fulfill this requirement, specialized algorithms have been developed to perform Distribution System State Estimation (DSSE). Due to the particularities of such networks, those algorithms often rely on simplifications and approximations of the measurement model which make difficult to generalize their results. This paper presents a sparse and numerically stable implementation of an algorithm for DSSE, which does not require any additional assumption from the traditional state estimation formulation. The numerical stability is guaranteed by using Multifrontal QR factorization and an optimal ordering technique is evaluated to reduce fill-in. Simulation results are carried out with IEEE three-phase unbalanced test feeders to evaluate the algorithm.
BibTeX:
@article{Hebling2020,
  author = {Gustavo M. Hebling and Julio A.D. Massignan and João B.A. London Junior and Marcos H.M. Camillo},
  title = {Sparse and numerically stable implementation of a distribution system state estimation based on Multifrontal QR factorization},
  journal = {Electric Power Systems Research},
  publisher = {Elsevier BV},
  year = {2020},
  volume = {189},
  pages = {106734},
  doi = {10.1016/j.epsr.2020.106734}
}
Herault T, Robert† Y, Bosilca G, Harrison RJ, Lewis CA and Valeev EF (2020), "Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure". Thesis at: Institut de recherche en informatique de Toulouse (IRIT).
Abstract: Many domains of scientific simulation (chemistry, condensed matter physics, data science) increasingly eschew dense tensors for block-sparse tensors, sometimes with additional structure (recursive hierarchy, rank sparsity, etc.). Distributed-memory parallel computation with block-sparse tensorial data is paramount to minimize the time-tosolution (e.g., to study dynamical problems or for real-time analysis) and to accommodate problems of realistic size that are too large to fit into the host/device memory of a single node equipped with accelerators. Unfortunately, computation with such irregular data structures is a poor match to the dominant imperative, bulk-synchronous parallel programming model. In this paper, we focus on the critical element of block-sparse tensor algebra, namely binary tensor contraction, and report on an efficient and scalable implementation using the task-focused PaRSEC runtime. High performance of the block-sparse tensor contraction on the Summit supercomputer is demonstrated for synthetic data as well as for real data involved in electronic structure simulations of unprecedented size.
BibTeX:
@techreport{Herault2020,
  author = {Thomas Herault and Yves Robert† and George Bosilca and Robert J. Harrison and Cannada A. Lewis and Edward F. Valeev},
  title = {Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure},
  school = {Institut de recherche en informatique de Toulouse (IRIT)},
  year = {2020},
  url = {https://hal.inria.fr/hal-02872813/document}
}
Herholz P and Sorkine-Hornung O (2020), "Sparse Cholesky Updates for Interactive Mesh Parameterization", ACM Transaction son Computer Graphics. Vol. 39(6)
Abstract: We present a novel linear solver for interactive parameterization tasks. Our method is based on the observation that quasi-conformal parameterizations of a triangle mesh are largely determined by boundary conditions. These boundary conditions are typically constructed interactively by users, who have to take several artistic and geometric constraints into account while introducing cuts on the geometry. Commonly, the main computational burden in these methods is solving a linear system every time new boundary conditions are imposed. The core of our solver is a novel approach to efficiently update the Cholesky factorization of the linear system to reflect new boundary conditions, thereby enabling a seamless and interactive workflow even for large meshes consisting of several millions of vertices
BibTeX:
@article{Herholz2020,
  author = {Philipp Herholz and Olga Sorkine-Hornung},
  title = {Sparse Cholesky Updates for Interactive Mesh Parameterization},
  journal = {ACM Transaction son Computer Graphics},
  year = {2020},
  volume = {39},
  number = {6},
  url = {https://igl.ethz.ch/projects/sparse-cholesky-update/sparse-cholesky-update-paper.pdf}
}
Hermans B, Themelis A and Patrinos P (2020), "QPALM: A Proximal Augmented Lagrangian Method for Nonconvex Quadratic Programs", October, 2020.
Abstract: We propose QPALM, a nonconvex quadratic programming (QP) solver based on the proximal augmented Lagrangian method. This method solves a sequence of inner subproblems which can be enforced to be strongly convex and which therefore admit of a unique solution. The resulting steps are shown to be equivalent to inexact proximal point iterations on the extended-real-valued cost function. Furthermore, we prove global convergence of such iterations to a stationary point at an R-linear rate in the specific case of a (possibly nonconvex) QP. The QPALM algorithm solves the subproblems iteratively using semismooth Newton directions and an exact linesearch. The former can be computed efficiently in most iterations by making use of suitable factorization update routines, while the latter requires the zero of a monotone, piecewise affine function. QPALM is implemented in open-source C code, with tailored linear algebra routines for the factorization in a self-written package LADEL. The resulting implementation is shown to be extremely robust in numerical simulations, solving all of the Maros-Meszaros problems and finding a stationary point for most of the nonconvex QPs in the Cutest test set. Furthermore, it is shown to be competitive against state-of-the-art convex QP solvers in typical QPs arising from application domains such as portfolio optimization and model predictive control. As such, QPALM strikes a unique balance between solving both easy and hard problems efficiently.
BibTeX:
@article{Hermans2020,
  author = {Ben Hermans and Andreas Themelis and Panagiotis Patrinos},
  title = {QPALM: A Proximal Augmented Lagrangian Method for Nonconvex Quadratic Programs},
  year = {2020}
}
Hey T, Butler K, Jackson S and Thiyagalingam J (2020), "Machine learning and big scientific data", Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences., 1, 2020. Vol. 378(2166), pp. 20190054. The Royal Society.
Abstract: This paper reviews some of the challenges posed by the huge growth of experimental data generated by the new generation of large-scale experiments at UK national facilities at the Rutherford Appleton Laboratory (RAL) site at Harwell near Oxford. Such ‘Big Scientific Data' comes from the Diamond Light Source and Electron Microscopy Facilities, the ISIS Neutron and Muon Facility and the UK's Central Laser Facility. Increasingly, scientists are now required to use advanced machine learning and other AI technologies both to automate parts of the data pipeline and to help find new scientific discoveries in the analysis of their data. For commercially important applications, such as object recognition, natural language processing and automatic translation, deep learning has made dramatic breakthroughs. Google's DeepMind has now used the deep learning technology to develop their AlphaFold tool to make predictions for protein folding. Remarkably, it has been able to achieve some spectacular results for this specific scientific problem. Can deep learning be similarly transformative for other scientific problems? After a brief review of some initial applications of machine learning at the RAL, we focus on challenges and opportunities for AI in advancing materials science. Finally, we discuss the importance of developing some realistic machine learning benchmarks using Big Scientific Data coming from several different scientific domains. We conclude with some initial examples of our ‘scientific machine learning' benchmark suite and of the research challenges these benchmarks will enable. \ This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science'.
BibTeX:
@article{Hey2020,
  author = {Tony Hey and Keith Butler and Sam Jackson and Jeyarajan Thiyagalingam},
  title = {Machine learning and big scientific data},
  journal = {Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences},
  publisher = {The Royal Society},
  year = {2020},
  volume = {378},
  number = {2166},
  pages = {20190054},
  doi = {10.1098/rsta.2019.0054}
}
Higham NJ and Mary T (2020), "Sharper Probabilistic Backward Error Analysis for Basic Linear Algebra Kernels with Random Data"
Abstract: Standard backward error analyses for numerical linear algebra algorithms provide worst-case bounds that can significantly overestimate the backward error. Our recent probabilistic error analysis, which assumes rounding errors to be independent random variables [SIAM J. Sci. Comput., 41 (2019), pp. A2815–A2835], contains smaller constants but its bounds can still be pessimistic. We perform a new probabilistic error analysis that assumes both the data and the rounding errors to be random variables and assumes only mean independence. We prove that for data with zero or small mean we can relax the existing probabilistic bounds of order sqrtnu to much sharper bounds of order u, which are independent of n. Our fundamental result is for summation and we use it to derive results for inner products, matrix–vector products, and matrix–matrix products. The analysis answers the open question of why random data distributed on [-1, 1] leads to smaller error growth for these kernels than random data distributed on [0, 1]. We also propose a new algorithm for multiplying two matrices that transforms the rows of the first matrix to have zero mean and we show that it can achieve significantly more accurate results than standard matrix multiplication.
BibTeX:
@article{Higham2020,
  author = {Higham, Nicholas J. and Mary, Theo},
  title = {Sharper Probabilistic Backward Error Analysis for Basic Linear Algebra Kernels with Random Data},
  year = {2020},
  url = {http://eprints.maths.manchester.ac.uk/2743/1/paper.pdf}
}
Higham DJ, Higham NJ and Pranesh S (2020), "Random Matrices Generating Large Growth in LU Factorization with Pivoting"
Abstract: We identify a class of random, dense n × n matrices for which LU factorization with any form of pivoting produces a growth factor of at least n/(4 log n) for large n with high probability. The condition number of the matrices can be arbitrarily chosen and large growth also happens for the transpose. No previous matrices with all these properties were known. The matrices can be generated by the MATLAB function gallery('randsvd',..), and they are formed as the product of two random orthogonal matrices from the Haar distribution with a diagonal matrix having only one diagonal entry different from 1, which lies between 0 and 1 (the "one small singular value" case). Our explanation for the large growth uses the fact that the maximum absolute value of any element of a Haar distributed orthogonal matrix tends to be relatively small for large n. We verify the behavior numerically, finding that for partial pivoting the actual growth is significantly larger than the lower bound, and much larger than the growth observed for random matrices with elements from the uniform [0, 1] or standard normal distributions. We show more generally that a rank-1 perturbation to an orthogonal matrix producing large growth for any form of pivoting also generates large growth under reasonable assumptions. Finally, we demonstrate that GMRES-based iterative refinement can provide stable solutions to Ax = b when large growth occurs in low precision LU factors, even when standard iterative refinement cannot.
BibTeX:
@article{Higham2020a,
  author = {Higham, Desmond J. and Higham, Nicholas J. and Pranesh, Srikara},
  title = {Random Matrices Generating Large Growth in LU Factorization with Pivoting},
  year = {2020},
  url = {http://eprints.maths.manchester.ac.uk/2764/1/paper.pdf}
}
Higham NJ and Liu X (2020), "A Multiprecision Derivative-Free Schur-Parlett Algorithm for Computing Matrix Functions"
Abstract: The Schur--Parlett algorithm, implemented in MATLAB as funm, computes a function f(A) of an n × n matrix A by using the Schur decomposition and a block recurrence of Parlett. The algorithm requires the ability to compute f and its derivatives, and it requires that f has a Taylor series expansion with a suitably large radius of convergence. We develop a version of the Schur--Parlett algorithm that requires only function values and uses higher precision arithmetic to evaluate f on the diagonal blocks of order greater than 2 (if there are any) of the reordered and blocked Schur form. The key idea is to compute by diagonalization the function of a small random diagonal perturbation of each triangular block, where the perturbation ensures that diagonalization will succeed. This multiprecision Schur--Parlett algorithm is applicable to arbitrary functions f and, like the original Schur--Parlett algorithm, it generally behaves in a numerically stable fashion. Our algorithm is inspired by Davies's randomized approximate diagonalization method, but we explain why that is not a reliable numerical method for computing matrix functions. We apply our algorithm to the matrix Mittag--Leffler function and show that it yields results of accuracy similar to, and in some cases much greater than, the state of the art algorithm for this function. The algorithm will be useful for evaluating any matrix function for which the derivatives of the underlying function are not readily available or accurately computable.
BibTeX:
@article{Higham2020b,
  author = {Nicholas J. Higham and Xiaobo Liu},
  title = {A Multiprecision Derivative-Free Schur-Parlett Algorithm for Computing Matrix Functions},
  year = {2020},
  url = {http://eprints.maths.manchester.ac.uk/2781/1/paper.pdf}
}
Hrga T and Povh J (2020), "MADAM: A parallel exact solver for Max-Cut based on semidefinite programming and ADMM", October, 2020.
Abstract: We present MADAM, a parallel semidefinite based exact solver for Max-Cut, a problem of finding the cut with maximum weight in a given graph. The algorithm uses branch and bound paradigm that applies alternating direction method of multipliers as the bounding routine to solve the basic semidefinite relaxation strengthened by a subset of hypermetric inequalities. The benefit of the new approach is less computationally expensive update rule for the dual variable with respect to the inequality constraints. We provide theoretical convergence of the algorithm, as well as extensive computational experiments with this method, to show that our algorithm outperformes current state-of-the-art approaches. Furthermore, by combining algorithmic ingredients from the serial algorithm we develop an efficient distributed parallel solver based on MPI.
BibTeX:
@article{Hrga2020,
  author = {Timotej Hrga and Janez Povh},
  title = {MADAM: A parallel exact solver for Max-Cut based on semidefinite programming and ADMM},
  year = {2020}
}
Hribar R, Hrga T, Papa G, Petelin G, Povh J, Pržulj N and Vukašinović V (2020), "Four algorithms to solve symmetric multi-type non-negative matrix tri-factorization problem", December, 2020.
Abstract: In this paper, we consider the symmetric multi-type non-negative matrix tri-factorization problem (SNMTF), which attempts to factorize several symmetric non-negative matrices simultaneously. This can be considered as a generalization of the classical non-negative matrix tri-factorization problem and includes a non-convex objective function which is a multivariate sixth degree polynomial and a has convex feasibility set. It has a special importance in data science, since it serves as a mathematical model for the fusion of different data sources in data clustering. We develop four methods to solve the SNMTF. They are based on four theoretical approaches known from the literature: the fixed point method (FPM), the block-coordinate descent with projected gradient (BCD), the gradient method with exact line search (GM-ELS) and the adaptive moment estimation method (ADAM). For each of these methods we offer a software implementation: for the former two methods we use Matlab and for the latter Python with the TensorFlow library. We test these methods on three data-sets: the synthetic data-set we generated, while the others represent real-life similarities between different objects. Extensive numerical results show that with sufficient computing time all four methods perform satisfactorily and ADAM most often yields the best mean square error (MSE). However, if the computation time is limited, FPM gives the best MSE because it shows the fastest convergence at the beginning. All data-sets and codes are publicly available on our GitLab profile.
BibTeX:
@article{Hribar2020,
  author = {Rok Hribar and Timotej Hrga and Gregor Papa and Gašper Petelin and Janez Povh and Nataša Pržulj and Vida Vukašinović},
  title = {Four algorithms to solve symmetric multi-type non-negative matrix tri-factorization problem},
  year = {2020}
}
Hu J, Berger-Vergiat L, Thomas S, Swirydowicz K, Yamazaki I, Mullowney P, Ananthan S, Rajamanickam S, Sitaraman J and Sprague MA (2020), "Compare linear-system solver and preconditioner stacks with emphasis on GPU performance and propose phase-2 NGP solvervelopment pathway". Thesis at: Office of Advanced Scientific Computing Research, Office of Science, US Department of Energy.
Abstract: The goal of the ExaWind project is to enable predictive simulations of wind farms comprised of many megawatt-scale turbines situated in complex terrain. Predictive simulations will require computational fluid dynamics (CFD) simulations for which the mesh resolves the geometry of the turbines and captures the rotation and large deflections of blades. Whereas such simulations for a single turbine are arguably petascale class, multi-turbine wind farm simulations will require exascale-class resources. The primary physics codes in the ExaWind project are Nalu-Wind, which is an unstructured-grid solver for the acoustically incompressible Navier-Stokes equations, and OpenFAST, which is a whole-turbine simulation code. The Nalu-Wind model consists of the mass-continuity Poisson-type equation for pressure and a momentum equation for the velocity. For such modeling approaches, simulation times are dominated by linear-system setup and solution for the continuity and momentum systems. For the ExaWind challenge problem, the moving meshes greatly affect overall solver costs as reinitialization of matrices and recomputation of preconditioners is required at every time step.\ In this report we evaluated GPU-performance baselines for the linear solvers in the Trilinos and hypre solver stacks using two representative Nalu-Wind simulations: an atmospheric boundary layer precursor simulation on a structured mesh, and a fixed-wing simulation using unstructured overset meshes. Both strong-scaling and weak-scaling experiments were conducted on the OLCF supercomputer Summit and similar proxy clusters. We focused on the performance of multi-threaded Gauss-Seidel and two-stage Gauss-Seidel that are extensions of classical Gauss-Seidel; of one-reduce GMRES, a communication-reducing variant of the Krylov GMRES; and algebraic multigrid methods that incorporate the afore-mentioned methods. The team has established that AMG methods are capable of solving linear systems arising from the fixed-wing overset meshes on CPU, a critical intermediate result for ExaWind FY20 Q3 and Q4 milestones. For the fixed-wing strong-scaling study (model with 3M grid-points), the team identified that Nalu-Wind simulations with the new Trilinos and hypre solvers scale to modest GPU counts, maintaining above 70% efficiency up to 6 GPUs. However, there still remain significant bottlenecks to performance: matrix assembly (hypre), AMG setup (hypre and Trilinos) In the weak-scaling experiments (going from 0.4M to 211M gridpoints), it's shown that the solver apply phases are faster on GPUs, but that Nalu-Wind simulation times grow, primarily due to the multigrid-setup process.\ Finally, based on the report outcomes, we propose a linear solver path-forward for the remainder of the ExaWind project. Near term, the NREL team will continue their work on GPU-based linear-system assembly. They will also investigate how the use of alternatives to the NVIDIA UVM (unified virtual memory) paradigm affects performance. Longer term, the NREL team will evaluate algorithmic performance on other types of accelerators and merge their improvements back to the main hypre repository branch. Near term, the Trilinos team will address performance bottlenecks identified in this milestone, such as implementing a GPU-based segregated momentum solve and reusing matrix graphs across linear-system assembly phases. Longer term, the Trilinos team will do detailed analysis and optimization of multigrid setup.
BibTeX:
@techreport{Hu2020,
  author = {Jonathan Hu and Luc Berger-Vergiat and Stephen Thomas and Kasia Swirydowicz and Ichitaro Yamazaki and Paul Mullowney and Sheyas Ananthan and Sivasankaran Rajamanickam and Jay Sitaraman and Michael A. Sprague},
  title = {Compare linear-system solver and preconditioner stacks with emphasis on GPU performance and propose phase-2 NGP solvervelopment pathway},
  school = {Office of Advanced Scientific Computing Research, Office of Science, US Department of Energy},
  year = {2020},
  url = {https://www.osti.gov/servlets/purl/1630801}
}
Hu X, Wu K and Zikatanov LT (2020), "A Posteriori Error Estimates for Multilevel Methods for Graph Laplacians", July, 2020.
Abstract: In this paper, we study a posteriori error estimators which aid multilevel iterative solvers for linear systems of graph Laplacians. In earlier works such estimates were computed by solving a perturbed global optimization problem, which could be computationally expensive. We propose a novel strategy to compute these estimates by constructing a Helmholtz decomposition on the graph based on a spanning tree and the corresponding cycle space. To compute the error estimator, we solve efficiently a linear system on the spanning tree and then a least-squares problem on the cycle space. As we show, such estimator has a nearly-linear computational complexity for sparse graphs under certain assumptions. Numerical experiments are presented to demonstrate the efficacy of the proposed method.
BibTeX:
@article{Hu2020a,
  author = {Xiaozhe Hu and Kaiyi Wu and Ludmil T. Zikatanov},
  title = {A Posteriori Error Estimates for Multilevel Methods for Graph Laplacians},
  year = {2020}
}
Hu Y, Xu M, Kuang Y and Durand F (2020), "AsyncTaichi: Whole-Program Optimizations for Megakernel Sparse Computation and Differentiable Programming", December, 2020.
Abstract: We present a whole-program optimization framework for the Taichi programming language. As an imperative language tailored for sparse and differentiable computation, Taichi's unique computational patterns lead to attractive optimization opportunities that do not present in other compiler or runtime systems. For example, to support iteration over sparse voxel grids, excessive list generation tasks are often inserted. By analyzing sparse computation programs at a higher level, our optimizer is able to remove the majority of unnecessary list generation tasks. To provide maximum programming flexibility, our optimization system conducts on-the-fly optimization of the whole computational graph consisting of Taichi kernels. The optimized Taichi kernels are then just-in-time compiled in parallel, and dispatched to parallel devices such as multithreaded CPU and massively parallel GPUs. Without any code modification on Taichi programs, our new system leads to 3.07 - 3.90× fewer kernel launches and 1.73 - 2.76× speed up on our benchmarks including sparse-grid physical simulation and differentiable programming.
BibTeX:
@article{Hu2020b,
  author = {Yuanming Hu and Mingkuan Xu and Ye Kuang and Frédo Durand},
  title = {AsyncTaichi: Whole-Program Optimizations for Megakernel Sparse Computation and Differentiable Programming},
  year = {2020}
}
Huang X, Liang X, Liu Z, Yu Y and Li L (2020), "SPAN: A Stochastic Projected Approximate Newton Method", February, 2020.
Abstract: Second-order optimization methods have desirable convergence properties. However, the exact Newton method requires expensive computation for the Hessian and its inverse. In this paper, we propose SPAN, a novel approximate and fast Newton method. SPAN computes the inverse of the Hessian matrix via low-rank approximation and stochastic Hessian-vector products. Our experiments on multiple benchmark datasets demonstrate that SPAN outperforms existing first-order and second-order optimization methods in terms of the convergence wall-clock time. Furthermore, we provide a theoretical analysis of the per-iteration complexity, the approximation error, and the convergence rate. Both the theoretical analysis and experimental results show that our proposed method achieves a better trade-off between the convergence rate and the per-iteration efficiency.
BibTeX:
@article{Huang2020,
  author = {Xunpeng Huang and Xianfeng Liang and Zhengyang Liu and Yue Yu and Lei Li},
  title = {SPAN: A Stochastic Projected Approximate Newton Method},
  year = {2020}
}
Huang K, Zhang J and Zhang S (2020), "Cubic Regularized Newton Method for Saddle Point Models: a Global and Local Convergence Analysis", August, 2020.
Abstract: In this paper, we propose a cubic regularized Newton (CRN) method for solving convex-concave saddle point problems (SPP). At each iteration, a cubic regularized saddle point subproblem is constructed and solved, which provides a search direction for the iterate. With properly chosen stepsizes, the method is shown to converge to the saddle point with global linear and local superlinear convergence rates, if the saddle point function is gradient Lipschitz and strongly-convex-strongly-concave. In the case that the function is merely convex-concave, we propose a homotopy continuation (or path-following) method. Under a Lipschitz-type error bound condition, we present an iteration complexity bound of &Oscr;(ln (1/𝜖)) to reach an 𝜖-solution through a homotopy continuation approach, and the iteration complexity bound becomes &Oscr;((1/𝜖)^1-\theta2) under a Hölderian-type error bound condition involving a parameter θ (0<<1).
BibTeX:
@article{Huang2020a,
  author = {Kevin Huang and Junyu Zhang and Shuzhong Zhang},
  title = {Cubic Regularized Newton Method for Saddle Point Models: a Global and Local Convergence Analysis},
  year = {2020}
}
Huo Z, Mei G, Casolla G and Giampaolo F (2020), "Designing an efficient parallel spectral clustering algorithm on multi-core processors in Julia", Journal of Parallel and Distributed Computing., 4, 2020. Vol. 138, pp. 211-221. Elsevier BV.
Abstract: Spectral clustering is widely used in data mining, machine learning and other fields. It can identify the arbitrary shape of a sample space and converge to the global optimal solution. Compared with the traditional k-means algorithm, the spectral clustering algorithm has stronger adaptability to data and better clustering results. However, the computation of the algorithm is quite expensive. In this paper, an efficient parallel spectral clustering algorithm on multi-core processors in the Julia language is proposed, and we refer to it as juPSC. The Julia language is a high-performance, open-source programming language. The juPSC is composed of three procedures: (1) calculating the affinity matrix, (2) calculating the eigenvectors, and (3) conducting k-means clustering. Procedures (1) and (3) are computed by the efficient parallel algorithm, and the COO format is used to compress the affinity matrix. Two groups of experiments are conducted to verify the accuracy and efficiency of the juPSC. Experimental results indicate that (1) the juPSC achieves speedups of approximately 14 × tilde 18 × on a 24-core CPU and that (2) the serial version of the juPSC is faster than the Python version of scikit-learn. Moreover, the structure and functions of the juPSC are designed considering modularity, which is convenient for combination and further optimization with other parallel computing platforms.
BibTeX:
@article{Huo2020,
  author = {Zenan Huo and Gang Mei and Giampaolo Casolla and Fabio Giampaolo},
  title = {Designing an efficient parallel spectral clustering algorithm on multi-core processors in Julia},
  journal = {Journal of Parallel and Distributed Computing},
  publisher = {Elsevier BV},
  year = {2020},
  volume = {138},
  pages = {211--221},
  doi = {10.1016/j.jpdc.2020.01.003}
}
Hussain MT, Selvitopi O, Buluç A and Azad A (2020), "Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale", October, 2020.
Abstract: Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in various graph, scientific computing and machine learning algorithms. In this paper, we consider SpGEMMs performed on hundreds of thousands of processors generating trillions of nonzeros in the output matrix. Distributed SpGEMM at this extreme scale faces two key challenges: (1) high communication cost and (2) inadequate memory to generate the output. We address these challenges with an integrated communication-avoiding and memory-constrained SpGEMM algorithm that scales to 262,144 cores (more than 1 million hardware threads) and can multiply sparse matrices of any size as long as inputs and a fraction of output fit in the aggregated memory. As we go from 16,384 cores to 262,144 cores on a Cray XC40 supercomputer, the new SpGEMM algorithm runs 10x faster when multiplying large-scale protein-similarity matrices.
BibTeX:
@article{Hussain2020,
  author = {Md Taufique Hussain and Oguz Selvitopi and Aydin Buluç and Ariful Azad},
  title = {Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale},
  year = {2020}
}
Iakymchuk R, Barreda M, Graillat S, Aliaga J and Quintana-Ortí E (2020), "Reproducibility of Parallel Preconditioned Conjugate Gradient in Hybrid Programming Environments"
Abstract: The Preconditioned Conjugate Gradient method is often employed for the solution of linear systems of equations arising in numerical simulations of physical phenomena. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we propose two algorithmic solutions that originate from the ExBLAS project to enhance the accuracy of the solver as well as to ensure its reproducibility in a hybrid MPI + OpenMP tasks programming environment. One is based on ExBLAS and preserves every bit of information until the final rounding, while the other relies upon floating-point expansions and, hence, expands the intermediate precision. Instead of converting the entire solver into its ExBLAS-related implementation, we identify those parts that violate reproducibility/non-associativity, secure them, and combine this with the sequential executions. These algorithmic strategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify these approaches on two modern HPC systems: both versions deliver reproducible number of iterations, residuals, direct errors, and vector-solutions for the overhead of less than 37.7 % on 768 cores.
BibTeX:
@article{Iakymchuk2020,
  author = {Roman Iakymchuk and Maria Barreda and Stef Graillat and José Aliaga and Enrique Quintana-Ortí},
  title = {Reproducibility of Parallel Preconditioned Conjugate Gradient in Hybrid Programming Environments},
  year = {2020},
  url = {https://hal.archives-ouvertes.fr/hal-02427795}
}
Inguva MSC and Seventiline DJ (2020), "Implementation of FPGA Design of FFT Architecture based on CORDIC Algorithm", International Journal of Electronics., December, 2020. Informa UK Limited.
Abstract: The coordinate rotation digital computer (CORDIC) is a class of shift-add algorithm for the rotation of vectors on a plane. To compute digital waves, there are various techniques have been used the trigonometric function, but it needs a huge usage of memory. CORDIC algorithm has a flexible feature when compared with other algorithms the quantization accuracy is high. The major problem in this CORDIC algorithm is the linear rate of convergence with the speed of the iteration. The overall performance is affected by several numbers of shifts-add operations and high power consumption. The main aim of the improved CORDIC algorithm is to utilize an integrated adder subtractor in place of binary adder subtractor to decrease the count of iterations and hardware reduction technique. The improved CORDIC splits the rotation angle into several series of micro-rotation angles to calculate the rotation and the new set of angle provides a fast convergence. The Canonical signed-digit (CSD) approach together with Hcub algorithm employs for the number of adder subtractor reduction and shifters in CORDIC architecture design. The performances of the proposed CORDIC design have been verified by employing it in FFT implementation. The simulation result indicates the higher frequency of 77.20%, 82.78%, 78.30% and 76.57% when compared with conventional methods. The evaluation of FFT is also done by comparing with the conventional methods and the proposed FFT is better for CORDIC design. The power consumption, number of iterations and the hardware complexity reduce by using the improved CORDIC and the working of this proposed algorithm is evaluated through the FPGA implementation.
BibTeX:
@article{Inguva2020,
  author = {Mr. Sharath Chandra Inguva and Dr. J.B. Seventiline},
  title = {Implementation of FPGA Design of FFT Architecture based on CORDIC Algorithm},
  journal = {International Journal of Electronics},
  publisher = {Informa UK Limited},
  year = {2020},
  doi = {10.1080/00207217.2020.1870750}
}
Isotton G, Janna C and Bernaschi M (2020), "A GPU-accelerated adaptive FSAI preconditioner for massively parallel simulations", October, 2020.
Abstract: The solution of linear systems of equations is a central task in a number of scientific and engineering applications. In many cases the solution of linear systems may take most of the simulation time thus representing a major bottleneck in the further development of scientific and technical software. For large scale simulations, nowadays accounting for several millions or even billions of unknowns, it is quite common to resort to preconditioned iterative solvers for exploiting their low memory requirements and, at least potential, parallelism. Approximate inverses have been shown to be robust and effective preconditioners in various contexts. In this work, we show how adaptive FSAI, an approximate inverse characterized by a very high degree of parallelism, can be successfully implemented on a distributed memory computer equipped with GPU accelerators. Taking advantage of GPUs in adaptive FSAI set-up is not a trivial task, nevertheless we show through an extensive numerical experimentation how the proposed approach outperforms more traditional preconditioners and results in a close-to-ideal behaviour in challenging linear algebra problems.
BibTeX:
@article{Isotton2020,
  author = {Giovanni Isotton and Carlo Janna and Massimo Bernaschi},
  title = {A GPU-accelerated adaptive FSAI preconditioner for massively parallel simulations},
  year = {2020}
}
Iwashita T, Suzuki K and Fukaya T (2020), "An Integer Arithmetic-Based Sparse Linear Solver Using a GMRES Method and Iterative Refinement", September, 2020.
Abstract: In this paper, we develop a (preconditioned) GMRES solver based on integer arithmetic, and introduce an iterative refinement framework for the solver. We describe the data format for the coefficient matrix and vectors for the solver that is based on integer or fixed-point numbers. To avoid overflow in calculations, we introduce initial scaling and logical shifts (adjustments) of operands in arithmetic operations. We present the approach for operand shifts, considering the characteristics of the GMRES algorithm. Numerical tests demonstrate that the integer arithmetic-based solver with iterative refinement has comparable solver performance in terms of convergence to the standard solver based on floating-point arithmetic. Moreover, we show that preconditioning is important, not only for improving convergence but also reducing the risk of overflow.
BibTeX:
@article{Iwashita2020,
  author = {Takeshi Iwashita and Kengo Suzuki and Takeshi Fukaya},
  title = {An Integer Arithmetic-Based Sparse Linear Solver Using a GMRES Method and Iterative Refinement},
  year = {2020}
}
Jackson A, Thouvenin P-A, Jiang M, Abdulaziz A, Dabbech A and Wiaux Y (2020), "Scaling convex optimisation for radio astronomy", In Proceedings of the ISC High Performance Computing., December, 2020.
Abstract: Aperture synthesis by interferometry in radio astronomy allows observation of the sky by antenna arrays with otherwise inaccessible angular resolutions and sensitivities, providing a whole wealth of information for astrophysics and cosmology. At the target resolution and dynamic range of interest to upcoming telescopes, image cubes will reach close to Petabyte sizes, with data volumes orders of magnitude larger, possibly verging on the Exabyte scale. In this context, convex optimisation theory offers modern algorithmic structures that can handle extreme data volumes by distributing the computation across large computer systems, while explicitly incorporating complex image prior models to regularise the problem. Motivated by these advantages, we introduce a C++ library and set of applications, dubbed Puri-Psi, as a production implementation of the recently proposed RI imaging approach HyperSARA, previously available as a proof-of-concept MATLAB implementation. Puri-Psi enables scaling up to very large data volumes by parallelising across a large number processes and compute nodes. We validate Puri-Psi against the MATLAB implementation, and evaluate its performance in a representative high performance computing setting.
BibTeX:
@inproceedings{Jackson2020,
  author = {Adrian Jackson and Pierre-Antoine Thouvenin and Ming Jiang and Abdullah Abdulaziz and Arwa Dabbech and Yves Wiaux},
  title = {Scaling convex optimisation for radio astronomy},
  booktitle = {Proceedings of the ISC High Performance Computing},
  year = {2020},
  url = {https://researchportal.hw.ac.uk/files/44074559/Scaling_convex_optimisation_for_radio_astronomy.pdf}
}
Jakovetic D, Krejic N, Jerinkic NK, Malaspina G and Micheletti A (2020), "Distributed Fixed Point Method for Solving Systems of Linear Algebraic Equations", January, 2020.
Abstract: We present a class of iterative fully distributed fixed point methods to solve a system of linear equations, such that each agent in the network holds one of the equations of the system. Under a generic directed, strongly connected network, we prove a convergence result analogous to the one for fixed point methods in the classical, centralized, framework: the proposed method converges to the solution of the system of linear equations at a linear rate. We further explicitly quantify the rate in terms of the linear system and the network parameters. Next, we show that the algorithm provably works under time-varying directed networks provided that the underlying graph is connected over bounded iteration intervals, and we establish a linear convergence rate for this setting as well. A set of numerical results is presented, demonstrating practical benefits of the method over existing alternatives.
BibTeX:
@article{Jakovetic2020,
  author = {Dusan Jakovetic and Natasa Krejic and Natasa Krklec Jerinkic and Greta Malaspina and Alessandra Micheletti},
  title = {Distributed Fixed Point Method for Solving Systems of Linear Algebraic Equations},
  year = {2020}
}
Jakovetic D, Bajovic D, Xavier J and Moura JMF (2020), "Primal-Dual Methods for Large-Scale and Distributed Convex Optimization and Data Analytics", Proceedings of the IEEE. , pp. 1-16. Institute of Electrical and Electronics Engineers (IEEE).
Abstract: The augmented Lagrangian method (ALM) is a classical optimization tool that solves a given ``difficult'' (constrained) problem via finding solutions of a sequence of ``easier'' (often unconstrained) subproblems with respect to the original (primal) variable, wherein constraints satisfaction is controlled via the so-called dual variables. ALM is highly flexible with respect to how primal subproblems can be solved, giving rise to a plethora of different primal-dual methods. The powerful ALM mechanism has recently proved to be very successful in various large-scale and distributed applications. In addition, several significant advances have appeared, primarily on precise complexity results with respect to computational and communication costs in the presence of inexact updates and design and analysis of novel optimal methods for distributed consensus optimization. We provide a tutorial-style introduction to ALM and its variants for solving convex optimization problems in large-scale and distributed settings. We describe control-theoretic tools for the algorithms' analysis and design, survey recent results, and provide novel insights into the context of two emerging applications: federated learning and distributed energy trading.
BibTeX:
@article{Jakovetic2020a,
  author = {Dusan Jakovetic and Dragana Bajovic and Joao Xavier and Jose M. F. Moura},
  title = {Primal-Dual Methods for Large-Scale and Distributed Convex Optimization and Data Analytics},
  journal = {Proceedings of the IEEE},
  publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
  year = {2020},
  pages = {1--16},
  doi = {10.1109/jproc.2020.3007395}
}
Jalving J, Shin S and Zavala VM (2020), "A Graph-Based Modeling Abstraction for Optimization: Concepts and Implementation in Plasmo.jl", June, 2020.
Abstract: We present a general graph-based modeling abstraction for optimization that we call an OptiGraph. Under this abstraction, any optimization problem is treated as a hierarchical hypergraph in which nodes represent optimization subproblems and edges represent connectivity between such subproblems. The abstraction enables the modular construction of highly complex models in an intuitive manner, facilitates the use of graph analysis tools (to perform partitioning, aggregation, and visualization tasks), and facilitates communication of structures to decomposition algorithms. We provide an open-source implementation of the abstraction in the Julia-based package Plasmo.jl. We provide tutorial examples and large application case studies to illustrate the capabilities.
BibTeX:
@article{Jalving2020,
  author = {Jordan Jalving and Sungho Shin and Victor M. Zavala},
  title = {A Graph-Based Modeling Abstraction for Optimization: Concepts and Implementation in Plasmo.jl},
  year = {2020}
}
Jiang Z, Liu T, Zhang S, Guan Z, Yuan M and You H (2020), "Fast and Efficient Parallel Breadth-First Search with Power-law Graph Transformation", December, 2020.
Abstract: In the big data era, graph computing is widely used to exploit the hidden value in real-world graphs in various scenarios such as social networks, knowledge graphs, web searching, and recommendation systems. However, the random memory accesses result in inefficient use of cache and the irregular degree distribution leads to substantial load imbalance. Breadth-First Search (BFS) is frequently utilized as a kernel for many important and complex graph algorithms. In this paper, we describe a preprocessing approach using Reverse Cuthill-Mckee (RCM) algorithm to improve data locality and demonstrate how to achieve an efficient load balancing for BFS. Computations on RCM-reordered graph data are also accelerated with SIMD executions. We evaluate the performance of the graph preprocessing approach on Kronecker graphs of the Graph500 benchmark and real-world graphs. Our BFS implementation on RCM-reordered graph data achieves 326.48 MTEPS/W (mega TEPS per watt) on an ARMv8 system, ranking 2nd on the Green Graph500 list in June 2020 (the 1st rank uses GPU acceleration).
BibTeX:
@article{Jiang2020,
  author = {Zite Jiang and Tao Liu and Shuai Zhang and Zhen Guan and Mengting Yuan and Haihang You},
  title = {Fast and Efficient Parallel Breadth-First Search with Power-law Graph Transformation},
  year = {2020}
}
Jin Q and Mokhtari A (2020), "Non-asymptotic Superlinear Convergence of Standard Quasi-Newton Methods", March, 2020.
Abstract: In this paper, we study the non-asymptotic superlinear convergence rate of DFP and BFGS, which are two well-known quasi-Newton methods. The asymptotic superlinear convergence rate of these quasi-Newton methods has been extensively studied, but their explicit finite time local convergence rate has not been established yet. In this paper, we provide a finite time (non-asymptotic) convergence analysis for BFGS and DFP methods under the assumptions that the objective function is strongly convex, its gradient is Lipschitz continuous, and its Hessian is Lipschitz continuous only in the direction of the optimal solution. We show that in a local neighborhood of the optimal solution, the iterates generated by both DFP and BFGS converge to the optimal solution at a superlinear rate of &Oscr;((1 k)^k/2), where k is the number of iterations. In particular, for a specific choice of the local neighborhood, both DFP and BFGS converge to the optimal solution at the rate of (0.85k)^k/2. Our theoretical guarantee is one of the first results that provide a non-asymptotic superlinear convergence rate for DFP and BFGS quasi-Newton methods.
BibTeX:
@article{Jin2020,
  author = {Qiujiang Jin and Aryan Mokhtari},
  title = {Non-asymptotic Superlinear Convergence of Standard Quasi-Newton Methods},
  year = {2020}
}
Jung YM, Whang JJ and Yun S (2020), "Sparse probabilistic K-means", Applied Mathematics and Computation., 10, 2020. Vol. 382, pp. 125328. Elsevier BV.
Abstract: The goal of clustering is to partition a set of data points into groups of similar data points, called clusters. Clustering algorithms can be classified into two categories: hard and soft clustering. Hard clustering assigns each data point to one cluster exclusively. On the other hand, soft clustering allows probabilistic assignments to clusters. In this paper, we propose a new model which combines the benefits of these two models: clarity of hard clustering and probabilistic assignments of soft clustering. Since the majority of data usually have a clear association, only a few points may require a probabilistic interpretation. Thus, we apply the l_1 norm constraint to impose sparsity on probabilistic assignments. Moreover, we also incorporate outlier detection in our clustering model to simultaneously detect outliers which can cause serious problems in statistical analyses. To optimize the model, we introduce an alternating minimization method and prove its convergence. Numerical experiments and comparisons with existing models show the soundness and effectiveness of the proposed model.
BibTeX:
@article{Jung2020,
  author = {Yoon Mo Jung and Joyce Jiyoung Whang and Sangwoon Yun},
  title = {Sparse probabilistic K-means},
  journal = {Applied Mathematics and Computation},
  publisher = {Elsevier BV},
  year = {2020},
  volume = {382},
  pages = {125328},
  doi = {10.1016/j.amc.2020.125328}
}
Jyothi R and Babu P (2020), "PIANO: A Fast Parallel Iterative Algorithm for Multinomial and Sparse Multinomial Logistic Regression", February, 2020.
Abstract: Multinomial Logistic Regression is a well-studied tool for classification and has been widely used in fields like image processing, computer vision and, bioinformatics, to name a few. Under a supervised classification scenario, a Multinomial Logistic Regression model learns a weight vector to differentiate between any two classes by optimizing over the likelihood objective. With the advent of big data, the inundation of data has resulted in large dimensional weight vector and has also given rise to a huge number of classes, which makes the classical methods applicable for model estimation not computationally viable. To handle this issue, we here propose a parallel iterative algorithm: Parallel Iterative Algorithm for MultiNomial LOgistic Regression (PIANO) which is based on the Majorization Minimization procedure, and can parallely update each element of the weight vectors. Further, we also show that PIANO can be easily extended to solve the Sparse Multinomial Logistic Regression problem - an extensively studied problem because of its attractive feature selection property. In particular, we work out the extension of PIANO to solve the Sparse Multinomial Logistic Regression problem with l1 and l0 regularizations. We also prove that PIANO converges to a stationary point of the Multinomial and the Sparse Multinomial Logistic Regression problems. Simulations were conducted to compare PIANO with the existing methods, and it was found that the proposed algorithm performs better than the existing methods in terms of speed of convergence.
BibTeX:
@article{Jyothi2020,
  author = {R. Jyothi and P. Babu},
  title = {PIANO: A Fast Parallel Iterative Algorithm for Multinomial and Sparse Multinomial Logistic Regression},
  year = {2020}
}
Kahl K and Lang B (2020), "On the equivalence of the Hermitian eigenvalue problem and hypergraph edge elimination", March, 2020.
Abstract: It is customary to identify sparse matrices with the corresponding adjacency or incidence graph. For the solution of linear systems of equations using Gaussian elimination, the representation by its adjacency graph allows a symbolic computation that can be used to predict memory footprints and enables the determination of near-optimal elimination orderings based on heuristics. The Hermitian eigenvalue problem on the other hand seems to evade such treatment at first glance due to its inherent iterative nature. In this paper we prove this assertion wrong by showing the equivalence of the Hermitian eigenvalue problem with a symbolic edge elimination procedure. A symbolic calculation based on the incidence graph of the matrix can be used in analogy to the symbolic phase of Gaussian elimination to develop heuristics which reduce memory footprint and computations. Yet, we also show that the question of an optimal elimination strategy remains NP-hard, in analogy to the linear systems case.
BibTeX:
@article{Kahl2020,
  author = {Karsten Kahl and Bruno Lang},
  title = {On the equivalence of the Hermitian eigenvalue problem and hypergraph edge elimination},
  year = {2020}
}
Kalantzis V (2020), "A Domain Decomposition Rayleigh--Ritz Algorithm for Symmetric Generalized Eigenvalue Problems", SIAM Journal on Scientific Computing., January, 2020. Vol. 42(6), pp. C410-C435. Society for Industrial & Applied Mathematics (SIAM).
Abstract: This paper proposes a parallel domain decomposition Rayleigh--Ritz projection scheme to compute a selected number of eigenvalues (and, optionally, associated eigenvectors) of large and sparse symmetric pencils. The projection subspace associated with interface variables is built by computing a few of the eigenvectors and associated leading derivatives of a zeroth-order approximation of the nonlinear matrix-valued interface operator. On the other hand, the projection subspace associated with interior variables is built independently in each subdomain by exploiting local eigenmodes and matrix resolvent approximations. The sought eigenpairs are then approximated by a Rayleigh--Ritz projection onto the subspace formed by the union of these two subspaces. Several theoretical and practical details are discussed, and upper bounds of the approximation errors are provided. Our numerical experiments demonstrate the efficiency of the proposed technique on sequential/distributed memory architectures as well as its competitiveness against schemes such as shift-and-invert Lanczos and automated multilevel substructuring combined with p-way vertex-based partitionings.
BibTeX:
@article{Kalantzis2020,
  author = {Vassilis Kalantzis},
  title = {A Domain Decomposition Rayleigh--Ritz Algorithm for Symmetric Generalized Eigenvalue Problems},
  journal = {SIAM Journal on Scientific Computing},
  publisher = {Society for Industrial & Applied Mathematics (SIAM)},
  year = {2020},
  volume = {42},
  number = {6},
  pages = {C410--C435},
  doi = {10.1137/19m1280004}
}
Kamzolov D, Gasnikov A and Dvurechensky P (2020), "On the Optimal Combination of Tensor Optimization Methods", February, 2020.
Abstract: We consider the minimization problem of a sum of a number of functions having Lipshitz p-th order derivatives with different Lipschitz constants. In this case, to accelerate optimization, we propose a general framework allowing to obtain near-optimal oracle complexity for each function in the sum separately, meaning, in particular, that the oracle for a function with lower Lipschitz constant is called a smaller number of times. As a building block, we extend the current theory of tensor methods and show how to generalize near-optimal tensor methods to work with inexact tensor step. Further, we investigate the situation when the functions in the sum have Lipschitz derivatives of a different order. For this situation, we propose a generic way to separate the oracle complexity between the parts of the sum. Our method is not optimal, which leads to an open problem of the optimal combination of oracles of a different order.
BibTeX:
@article{Kamzolov2020,
  author = {Dmitry Kamzolov and Alexander Gasnikov and Pavel Dvurechensky},
  title = {On the Optimal Combination of Tensor Optimization Methods},
  year = {2020}
}
Kamzolov D and Gasnikov A (2020), "Near-Optimal Hyperfast Second-Order Method for convex optimization and its Sliding", February, 2020.
Abstract: In this paper, we present a new Hyperfast Second-Order Method with convergence rate O(N^-5) up to a logarithmic factor for the convex function with Lipshitz 3rd derivative. This method based on two ideas. The first comes from the superfast second-order scheme of Yu. Nesterov (CORE Discussion Paper 2020/07, 2020). It allows implementing the third-order scheme by solving subproblem using only the second-order oracle. This method converges with rate O(N^-4). The second idea comes from the work of Kamzolov et al. (arXiv:2002.01004). It is the inexact near-optimal third-order method. In this work, we improve its convergence and merge it with the scheme of solving subproblem using only the second-order oracle. As a result, we get convergence rate O(N^-5) up to a logarithmic factor. This convergence rate is near-optimal and the best known up to this moment. Further, we investigate the situation when there is a sum of two functions and improve the sliding framework from Kamzolov et al. (arXiv:2002.01004) for the second-order methods.
BibTeX:
@article{Kamzolov2020a,
  author = {Dmitry Kamzolov and Alexander Gasnikov},
  title = {Near-Optimal Hyperfast Second-Order Method for convex optimization and its Sliding},
  year = {2020}
}
Kanagasabapathi S and Thushara M (2020), "FORWARD AND BACKWARD STATIC ANALYSIS FOR CRITICAL NUMERICAL ACCURACY IN FLOATING POINT PROGRAMS", Computer Science., 4, 2020. Vol. 21(2) AGHU University of Science and Technology Press.
Abstract: In this article, we introduce a new static analysis for numerical accuracy. We address the problem of determining the minimal accuracy on the inputs and on the intermediary results of a program containing floating-point computations in order to ensure the desired accuracy of the outputs. The main approach is to combine a forward and backward static analysis, done by abstract interpretation. The backward analysis computes the minimal accuracy needed for the inputs and intermediary results of the program in order to ensure the desired accuracy of the results (as specified by the user). In practice, the information collected by our analysis may help optimize the formats used to represent the values stored in the variables of the program or to select the appropriate sensors. To illustrate our analysis, we have shown a prototype example with experimental results.
BibTeX:
@article{Kanagasabapathi2020,
  author = {Somasundaram Kanagasabapathi and MG Thushara},
  title = {FORWARD AND BACKWARD STATIC ANALYSIS FOR CRITICAL NUMERICAL ACCURACY IN FLOATING POINT PROGRAMS},
  journal = {Computer Science},
  publisher = {AGHU University of Science and Technology Press},
  year = {2020},
  volume = {21},
  number = {2},
  doi = {10.7494/csci.2020.21.2.3421}
}
Kannan R, Sao P, Lu H, Herrmannova D, Thakkar V, Patton R, Vuduc R and Potok T (2020), "Scalable Knowledge Graph Analytics at 136 Petaflop/s", In Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis. Los Alamitos, CA, USA, 11, 2020. , pp. 61-73. IEEE Computer Society.
Abstract: We are motivated by newly proposed methods for data mining large-scale corpora of scholarly publications, such as the full biomedical literature, which may consist of tens of millions of papers spanning decades of research. In this setting, analysts seek to discover how concepts relate to one another. They construct graph representations from annotated text databases and then formulate the relationship-mining problem as one of computing all-pairs shortest paths (APSP), which becomes a significant bottleneck. In this context, we present a new high-performance algorithm and implementation of the Floyd- Warshall algorithm for distributed-memory parallel computers accelerated by GPUs, which we call DSNAPSHOT (Distributed Accelerated Semiring All-Pairs Shortest Path). For our largest experiments, we ran DSNAPSHOT on a connected input graph with millions of vertices using 4,096 nodes (24,576 GPUs) of the Oak Ridge National Laboratory's Summit supercomputer system. We find DSNAPSHOT achieves a sustained performance of 13610^15 floating-point operations per second (136 petaflop/s) at a parallel efficiency of 90% under weak scaling and, in absolute speed, 70% of the best possible performance given our computation (in the single-precision tropical semiring or "min-plus" algebra). Looking forward, we believe this novel capability will enable the mining of scholarly knowledge corpora when embedded and integrated into artificial intelligence-driven natural language processing workflows at scale.
BibTeX:
@inproceedings{Kannan2020,
  author = {R. Kannan and P. Sao and H. Lu and D. Herrmannova and V. Thakkar and R. Patton and R. Vuduc and T. Potok},
  title = {Scalable Knowledge Graph Analytics at 136 Petaflop/s},
  booktitle = {Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis},
  publisher = {IEEE Computer Society},
  year = {2020},
  pages = {61--73},
  url = {https://doi.ieeecomputersociety.org/10.1109/SC41405.2020.00010},
  doi = {10.1109/SC41405.2020.00010}
}
Kardos J, Kourounis D and Schenk O (2020), "Reduced-Space Interior Point Methods in Power Grid Problems", January, 2020.
Abstract: Due to critical environmental issues, the power systems have to accommodate a significant level of penetration of renewable generation which requires smart approaches to the power grid control. Associated optimal control problems are large-scale nonlinear optimization problems with up to hundreds of millions of variables and constraints. The interior point methods become computationally intractable, mainly due to the solution of large linear systems. This document addresses the computational bottlenecks of the interior point method during the solution of the security constrained optimal power flow problems by applying reduced space quasi-Newton IPM, which could utilize high-performance computers due to the inherent parallelism in the adjoint method. Reduced space IPM approach and the adjoint method is a novel approach when it comes to solving the (security constrained) optimal power flow problems. These were previously used in the PDE-constrained optimization. The presented methodology is suitable for high-performance architectures due to inherent parallelism in the adjoint method during the gradient evaluation, since the individual contingency scenarios are modeled by independent set of the constraints. Preliminary evaluation of the performance and convergence is performed to study the reduced space approach.
BibTeX:
@article{Kardos2020,
  author = {Juraj Kardos and Drosos Kourounis and Olaf Schenk},
  title = {Reduced-Space Interior Point Methods in Power Grid Problems},
  year = {2020}
}
Karunakaran S and Selvaganesh L (2020), "A novel graph matrix representation: sequence of neighbourhood matrices with an application", SN Applied Sciences., 4, 2020. Vol. 2(5) Springer Science and Business Media LLC.
Abstract: In the study of network optimization, fnding the shortest path minimizing time/distance/cost from a source node to a destination node is one of the fundamental problems. Our focus here is to fnd the shortest path between any pair of nodes in a given undirected unweighted simple graph with the help of the sequence of powers of neighbourhood matrices. The authors recently introduced the concept of neighbourhood matrix as a novel representation of graphs using the neighbourhood sets of the vertices. In this article, an extension of the above work is presented by introducing a sequence of matrices, referred to as the sequence of powers of NM(G). It is denoted it by NM^(l) (G)=[ l_ij], 1 le l le k(G), where k(G) is called the iteration number, k(G) = ⌈ log_2 diameter(G) ⌉. As this sequence of matrices captures the distance between the nodes profoundly, we further develop the technique and present several characterizations. Based on the theoretical results, we present an algorithm to fnd the shortest path between any pair of nodes in a given graph. The proposed algorithm and the claims therein are formally validated through simulations on synthetic data and the real network data from Facebook. The empirical results are quite promising with our algorithm having best running time among all the existing well-known shortest path algorithms for the considered graph classes.
BibTeX:
@article{Karunakaran2020,
  author = {Sivakumar Karunakaran and Lavanya Selvaganesh},
  title = {A novel graph matrix representation: sequence of neighbourhood matrices with an application},
  journal = {SN Applied Sciences},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  volume = {2},
  number = {5},
  doi = {10.1007/s42452-020-2635-1}
}
Kedward L, Allen CB and Rendall T (2020), "Comparing Matrix-based and Matrix-free Discrete Adjoint Approaches to the Euler Equations", In AIAA Scitech 2020 Forum., 1, 2020. American Institute of Aeronautics and Astronautics.
Abstract: Detail is presented on the implementation of numerical derivatives with focus given to the discrete adjoint equations. Two approaches are considered: a hybrid matrix-based scheme where the convective Jacobian is constructed explicitly; and a matrix-free method using reverse-mode automatic differentiation. The hybrid matrix-based scheme exploits a compact convective stencil using graph colouring to evaluate the convective Jacobian terms in O(10) residual evaluations. Jacobian terms, grouped by colours, are evaluated using the complex step tangent model; this approach requires no external libraries or tools, minimal code modification and provides derivatives accurate to machine precision. The remaining artificial dissipation terms are trivial to differentiate by hand where the sensor coefficients are held constant. The hybrid matrix-based methodology is validated and compared with the `traditional` matrix-free approach using reverse-mode automatic differentiation. The adjoint equations using both approaches are solved using the same fixed-point Runge-Kutta iteration accelerated by agglomeration multigrid. No loss in accuracy is seen between the matrix-based and the matrix-free methods when validated with the complex step tangent model. The hybrid matrix-based approach demonstrates a notable runtime performance advantage over the traditional matrix-free approach due to the prior calculation of Jacobian terms. Moreover, the convective Jacobian calculation takes less than 5% of primal runtime due to the compact stencil used. A critical analysis of the results and methodology is consequently presented, focusing on the general applicability of the hybrid approach to more complex problems.
BibTeX:
@inproceedings{Kedward2020,
  author = {Laurence Kedward and Christian B. Allen and T. Rendall},
  title = {Comparing Matrix-based and Matrix-free Discrete Adjoint Approaches to the Euler Equations},
  booktitle = {AIAA Scitech 2020 Forum},
  publisher = {American Institute of Aeronautics and Astronautics},
  year = {2020},
  doi = {10.2514/6.2020-1294}
}
Kepner J, Meiners C, Byun C, McGuire S, Davis T, Arcand W, Bernays J, Bestor D, Bergeron W, Gadepally V, Harnasch R, Hubbell M, Houle M, Jones M, Kirby A, Klein A, Milechin L, Mullen J, Prout A, Reuther A, Rosa A, Samsi S, Stetson D, Tse A, Yee C and Michaleas P (2020), "Multi-Temporal Analysis and Scaling Relations of 100,000,000,000 Network Packets", August, 2020.
Abstract: Our society has never been more dependent on computer networks. Effective utilization of networks requires a detailed understanding of the normal background behaviors of network traffic. Large-scale measurements of networks are computationally challenging. Building on prior work in interactive supercomputing and GraphBLAS hypersparse hierarchical traffic matrices, we have developed an efficient method for computing a wide variety of streaming network quantities on diverse time scales. Applying these methods to 100,000,000,000 anonymized source-destination pairs collected at a network gateway reveals many previously unobserved scaling relationships. These observations provide new insights into normal network background traffic that could be used for anomaly detection, AI feature engineering, and testing theoretical models of streaming networks.
BibTeX:
@article{Kepner2020,
  author = {Jeremy Kepner and Chad Meiners and Chansup Byun and Sarah McGuire and Timothy Davis and William Arcand and Jonathan Bernays and David Bestor and William Bergeron and Vijay Gadepally and Raul Harnasch and Matthew Hubbell and Micheal Houle and Micheal Jones and Andrew Kirby and Anna Klein and Lauren Milechin and Julie Mullen and Andrew Prout and Albert Reuther and Antonio Rosa and Siddharth Samsi and Doug Stetson and Adam Tse and Charles Yee and Peter Michaleas},
  title = {Multi-Temporal Analysis and Scaling Relations of 100,000,000,000 Network Packets},
  year = {2020}
}
Keriven N and Vaiter S (2020), "Sparse and Smooth: improved guarantees for Spectral Clustering in the Dynamic Stochastic Block Model", February, 2020.
Abstract: In this paper, we analyse classical variants of the Spectral Clustering (SC) algorithm in the Dynamic Stochastic Block Model (DSBM). Existing results show that, in the relatively sparse case where the expected degree grows logarithmically with the number of nodes, guarantees in the static case can be extended to the dynamic case and yield improved error bounds when the DSBM is sufficiently smooth in time, that is, the communities do not change too much between two time steps. We improve over these results by drawing a new link between the sparsity and the smoothness of the DSBM: the more regular the DSBM is, the more sparse it can be, while still guaranteeing consistent recovery. In particular, a mild condition on the smoothness allows to treat the sparse case with bounded degree. We also extend these guarantees to the normalized Laplacian, and as a by-product of our analysis, we obtain to our knowledge the best spectral concentration bound available for the normalized Laplacian of matrices with independent Bernoulli entries.
BibTeX:
@article{Keriven2020,
  author = {Nicolas Keriven and Samuel Vaiter},
  title = {Sparse and Smooth: improved guarantees for Spectral Clustering in the Dynamic Stochastic Block Model},
  year = {2020}
}
Keyes DE, Ltaief H and Turkiyyah G (2020), "Hierarchical algorithms on hierarchical architectures", Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences., 1, 2020. Vol. 378(2166), pp. 20190055. The Royal Society.
Abstract: A traditional goal of algorithmic optimality, squeezing out flops, has been superseded by evolution in architecture. Flops no longer serve as a reasonable proxy for all aspects of complexity. Instead, algorithms must now squeeze memory, data transfers, and synchronizations, while extra flops on locally cached data represent only small costs in time and energy. Hierarchically low-rank matrices realize a rarely achieved combination of optimal storage complexity and high-computational intensity for a wide class of formally dense linear operators that arise in applications for which exascale computers are being constructed. They may be regarded as algebraic generalizations of the fast multipole method. Methods based on these hierarchical data structures and their simpler cousins, tile low-rank matrices, are well proportioned for early exascale computer architectures, which are provisioned for high processing power relative to memory capacity and memory bandwidth. They are ushering in a renaissance of computational linear algebra. A challenge is that emerging hardware architecture possesses hierarchies of its own that do not generally align with those of the algorithm. We describe modules of a software toolkit, hierarchical computations on manycore architectures, that illustrate these features and are intended as building blocks of applications, such as matrix-free higher-order methods in optimization and large-scale spatial statistics. Some modules of this open-source project have been adopted in the software libraries of major vendors.
BibTeX:
@article{Keyes2020,
  author = {D. E. Keyes and H. Ltaief and G. Turkiyyah},
  title = {Hierarchical algorithms on hierarchical architectures},
  journal = {Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences},
  publisher = {The Royal Society},
  year = {2020},
  volume = {378},
  number = {2166},
  pages = {20190055},
  doi = {10.1098/rsta.2019.0055}
}
Khalifa DB, Martel M and Adjé A (2020), "POP: A Tuning Assistant for Mixed-Precision Floating-Point Computations", In Communications in Computer and Information Science. , pp. 77-94. Springer International Publishing.
Abstract: In this article, we describe a static program analysis to determine the lowest floating-point precisions on inputs and intermediate results that guarantees a desired accuracy of the output values. A common practice used by developers without advanced training in computer arithmetic consists in using the highest precision available in hardware (double precision on most CPU's) which can be exorbitant in terms of energy consumption, memory traffic, and bandwidth capacity. To overcome this difficulty, we propose a new precision tuning tool for the floating-point programs integrating a static forward and backward analysis, done by abstract interpretation. Next, our analysis will be expressed as a set of linear constraints easily checked by an SMT solver.
BibTeX:
@incollection{Khalifa2020,
  author = {Dorra Ben Khalifa and Matthieu Martel and Assalé Adjé},
  title = {POP: A Tuning Assistant for Mixed-Precision Floating-Point Computations},
  booktitle = {Communications in Computer and Information Science},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {77--94},
  doi = {10.1007/978-3-030-46902-3_5}
}
Khimich OM, Popov OV, Chistyakov OV and Sidoruk VA (2020), "A Parallel Algorithm for Solving a Partial Eigenvalue Problem for Block-Diagonal Bordered Matrices", Cybernetics and Systems Analysis., November, 2020. Vol. 56(6), pp. 913-923. Springer Science and Business Media LLC.
Abstract: A hybrid algorithm of the iterative method for the solution subspace of a partial generalized eigenvalue problem for symmetric positive definite sparse matrices of block-diagonal structure with bordering on hybrid computers with graphic processors is proposed, efficiency coefficients of the algorithm are obtained, and the algorithm is tested against test and practical problems.
BibTeX:
@article{Khimich2020,
  author = {O. M. Khimich and O. V. Popov and O. V. Chistyakov and V. A. Sidoruk},
  title = {A Parallel Algorithm for Solving a Partial Eigenvalue Problem for Block-Diagonal Bordered Matrices},
  journal = {Cybernetics and Systems Analysis},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  volume = {56},
  number = {6},
  pages = {913--923},
  doi = {10.1007/s10559-020-00311-z}
}
Kim H, Zeng J, Liu Q, Abdel-Majeed M, Lee J and Jung C (2020), "Compiler-directed soft error resilience for lightweight GPU register file protection", In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation., 6, 2020. ACM.
Abstract: This paper presents Penny, a compiler-directed resilience scheme for protecting GPU register files (RF) against soft errors. Penny replaces the conventional error correction code (ECC) based RF protection by using less expensive error detection code (EDC) along with idempotence based recovery. Compared to the ECC protection, Penny can achieve either the same level of RF resilience yet with significantly lower hardware costs or stronger resilience using the same ECC due to its ability to detect multi-bit errors when it is used solely for detection. In particular, to address the lack of store buffers in GPUs, which causes both checkpoint storage overwriting and the high cost of checkpointing stores, Penny provides several compiler optimizations such as storage coloring and checkpoint pruning. Across 25 benchmarks, Penny causes only &ap;3% run-time overhead on average.
BibTeX:
@inproceedings{Kim2020,
  author = {Hongjune Kim and Jianping Zeng and Qingrui Liu and Mohammad Abdel-Majeed and Jaejin Lee and Changhee Jung},
  title = {Compiler-directed soft error resilience for lightweight GPU register file protection},
  booktitle = {Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation},
  publisher = {ACM},
  year = {2020},
  doi = {10.1145/3385412.3386033}
}
Kirk RO, Nolten M, Kevis R, Law TR, Maheswaran S, Wright SA, Powell S, Mudalige GR and Jarvis SA (2020), "Warwick Data Store: A Data Structure Abstraction Library", In Proceedings of the 2020 IEEE/ACM Conference on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems.
Abstract: With the increasing complexity of memory architectures and scientific applications, developing data structures that are performant, portable, scalable, and support developer productivity, is a challenging task. In this paper, we present Warwick Data Store (WDS), a lightweight and extensible C++ template library designed to manage these complexities and allow rapid prototyping. WDS is designed to abstract details of the underlying data structures away from the user, thus easing application development and optimisation. We show that using WDS does not significantly impact achieved performance across a variety of different scientific benchmarks and proxy-applications, compilers, and different architectures. The overheads are largely below 30% for smaller problems, with the overhead deceasing to below 10% when using larger problems. This shows that the library does not significantly impact the performance, while providing additional functionality to data structures, and the ability to optimise data structures without changing the application code.
BibTeX:
@inproceedings{Kirk2020,
  author = {Richard O. Kirk and Martin Nolten and Robert Kevis and Timothy R. Law and Satheesh Maheswaran and Steven A. Wright and Seimon Powell and Gihan R. Mudalige and Stephen A. Jarvis},
  title = {Warwick Data Store: A Data Structure Abstraction Library},
  booktitle = {Proceedings of the 2020 IEEE/ACM Conference on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems},
  year = {2020}
}
Klein O (2020), "An Improved Conjugate Gradients Method for Quasi-linear Bayesian Inverse Problems, Tested on an Example from Hydrogeology", In Modeling, Simulation and Optimization of Complex Processes HPSC 2018., December, 2020. , pp. 357-385. Springer International Publishing.
Abstract: We present a framework for high-performance quasi-linear Bayesian inverse modelling and its application in hydrogeology; extensions to other domains of application are straightforward due to generic programming and modular design choices. The central component of the framework is a collection of specialized preconditioned methods for nonlinear least squares: the classical three-term recurrence relation of Conjugate Gradients and related methods is replaced by a specific choice of six-term recurrence relation, which is used to reformulate the resulting optimization problem and eliminate several costly matrix-vector products. We demonstrate that this reformulation leads to improved performance, robustness, and accuracy for a synthetic example application from hydrogeology. The proposed prior-preconditioned caching CG scheme is the only one among the considered CG methods that scales perfectly in the number of estimated parameters. In the highly relevant case of sparse measurements, the proposed method is up to two orders of magnitude faster than the classical CG scheme, and at least six times faster than a prior-preconditioned, non-caching version. It is therefore particularly suited for the large-scale inversion of sparse observations.
BibTeX:
@incollection{Klein2020,
  author = {Ole Klein},
  title = {An Improved Conjugate Gradients Method for Quasi-linear Bayesian Inverse Problems, Tested on an Example from Hydrogeology},
  booktitle = {Modeling, Simulation and Optimization of Complex Processes HPSC 2018},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {357--385},
  doi = {10.1007/978-3-030-55240-4_17}
}
Klockiewicz B, Cambier L, Humble R, Tchelepi H and Darve E (2020), "Second Order Accurate Hierarchical Approximate Factorization of Sparse SPD Matrices", July, 2020.
Abstract: We describe a second-order accurate approach to sparsifying the off-diagonal blocks in approximate hierarchical matrix factorizations of sparse symmetric positive definite matrices. The norm of the error made by the new approach depends quadratically, not linearly, on the error in the low-rank approximation of the given block. The analysis of the resulting two-level preconditioner shows that the preconditioner is second-order accurate as well. We incorporate the new approach into the recent Sparsified Nested Dissection algorithm [SIAM J. Matrix Anal. Appl., 41 (2020), pp. 715-746], and test it on a wide range of problems. The new approach halves the number of Conjugate Gradient iterations needed for convergence, with almost the same factorization complexity, improving the total runtimes of the algorithm. Our approach can be incorporated into other rank-structured methods for solving sparse linear systems.
BibTeX:
@article{Klockiewicz2020,
  author = {Bazyli Klockiewicz and Léopold Cambier and Ryan Humble and Hamdi Tchelepi and Eric Darve},
  title = {Second Order Accurate Hierarchical Approximate Factorization of Sparse SPD Matrices},
  year = {2020}
}
Kochurov M, Karimov R and Kozlukov S (2020), "Geoopt: Riemannian Optimization in PyTorch", May, 2020.
Abstract: Geoopt is a research-oriented modular open-source package for Riemannian Optimization in PyTorch. The core of Geoopt is a standard Manifold interface which allows for the generic implementation of optimization algorithms. Geoopt supports basic Riemannian SGD as well as adaptive optimization algorithms. Geoopt also provides several algorithms and arithmetic methods for supported manifolds, which allow composing geometry-aware neural network layers that can be integrated with existing models.
BibTeX:
@article{Kochurov2020,
  author = {Max Kochurov and Rasul Karimov and Sergei Kozlukov},
  title = {Geoopt: Riemannian Optimization in PyTorch},
  year = {2020}
}
Kolodziej SP and Davis TA (2020), "Generalized Gains for Hybrid Vertex Separator Algorithms", In Proceedings of the SIAM Workshop on Combinatorial Scientific Computing., 1, 2020. , pp. 96-105. Society for Industrial and Applied Mathematics.
Abstract: In this paper, we derive generalized vertex gains for computing vertex separators, greatly improving the efficiency and data reuse in hybrid graph partitioning contexts. Using these generalized gains, we design a novel algorithm for computing vertex separators in arbitrary graphs and compare our approach to METIS, a popular graph partitioning library. In general, our hybrid algorithm scales well to very large graphs with the increased information sharing that the generalized gains afford.
BibTeX:
@incollection{Kolodziej2020,
  author = {Scott P. Kolodziej and Timothy A. Davis},
  title = {Generalized Gains for Hybrid Vertex Separator Algorithms},
  booktitle = {Proceedings of the SIAM Workshop on Combinatorial Scientific Computing},
  publisher = {Society for Industrial and Applied Mathematics},
  year = {2020},
  pages = {96--105},
  doi = {10.1137/1.9781611976229.10}
}
Kornowski G and Shamir O (2020), "High-Order Oracle Complexity of Smooth and Strongly Convex Optimization", October, 2020.
Abstract: In this note, we consider the complexity of optimizing a highly smooth (Lipschitz k-th order derivative) and strongly convex function, via calls to a k-th order oracle which returns the value and first k derivatives of the function at a given point, and where the dimension is unrestricted. Extending the techniques introduced in Arjevani et al. [2019], we prove that the worst-case oracle complexity for any fixed k to optimize the function up to accuracy 𝜖 is on the order of (\mu_k D^{k-1}λ)^23k+1+loglog(1𝜖) (up to log factors independent of 𝜖), where _k is the Lipschitz constant of the k-th derivative, D is the initial distance to the optimum, and λ is the strong convexity parameter.
BibTeX:
@article{Kornowski2020,
  author = {Guy Kornowski and Ohad Shamir},
  title = {High-Order Oracle Complexity of Smooth and Strongly Convex Optimization},
  year = {2020}
}
Kressner D, Lund K, Massei S and Palitta D (2020), "Compress-and-restart block Krylov subspace methods for Sylvester matrix equations", February, 2020.
Abstract: Block Krylov subspace methods (KSMs) comprise building blocks in many state-of-the-art solvers for large-scale matrix equations as they arise, e.g., from the discretization of partial differential equations. While extended and rational block Krylov subspace methods provide a major reduction in iteration counts over polynomial block KSMs, they also require reliable solvers for the coefficient matrices, and these solvers are often iterative methods themselves. It is not hard to devise scenarios in which the available memory, and consequently the dimension of the Krylov subspace, is limited. In such scenarios for linear systems and eigenvalue problems, restarting is a well explored technique for mitigating memory constraints. In this work, such restarting techniques are applied to polynomial KSMs for matrix equations with a compression step to control the growing rank of the residual. An error analysis is also performed, leading to heuristics for dynamically adjusting the basis size in each restart cycle. A panel of numerical experiments demonstrates the effectiveness of the new method with respect to extended block KSMs.
BibTeX:
@article{Kressner2020,
  author = {Daniel Kressner and Kathryn Lund and Stefano Massei and Davide Palitta},
  title = {Compress-and-restart block Krylov subspace methods for Sylvester matrix equations},
  year = {2020}
}
Kreutzer P, Kraus S and Philippsen M (2020), "Language-Agnostic Generation of Compilable Test Programs", In Proceedings of the13th IEEE International Conference on Software Testing, Validation and Verification., 10, 2020. IEEE.
Abstract: Testing is an integral part of the development of compilers and other language processors. To automatically create large sets of test programs, random program generators, or fuzzers, have emerged. Unfortunately, existing approaches are either language-specific (and thus require a rewrite for each language) or may generate programs that violate rules of the respective programming language (which limits their usefulness). This work introduces *Smith, a language-agnostic framework for the generation of valid, compilable test programs. It takes as input an abstract attribute grammar that specifies the syntactic and semantic rules of a programming language. It then creates test programs that satisfy all these rules. By aggressively pruning the search space and keeping the construction as local as possible, *Smith can generate huge, complex test programs in short time. We present four case studies covering four real-world programming languages (C, Lua, SQL, and SMT-LIB 2) to show that *Smith is both efficient and effective, while being flexible enough to support programming languages that differ considerably. We found bugs in all four case studies. For example, *Smith detected 165 different crashes in older versions of GCC and LLVM. *Smith and the language grammars are available online.
BibTeX:
@inproceedings{Kreutzer2020,
  author = {Patrick Kreutzer and Stefan Kraus and Michael Philippsen},
  title = {Language-Agnostic Generation of Compilable Test Programs},
  booktitle = {Proceedings of the13th IEEE International Conference on Software Testing, Validation and Verification},
  publisher = {IEEE},
  year = {2020},
  doi = {10.1109/icst46399.2020.00015}
}
Kronqvist J and Misener R (2020), "A disjunctive cut strengthening technique for convex MINLP", Optimization and Engineering., 8, 2020. Springer Science and Business Media LLC.
Abstract: Generating polyhedral outer approximations and solving mixed-integer linear relaxations remains one of the main approaches for solving convex mixed-integer nonlinear programming (MINLP) problems. There are several algorithms based on this concept, and the efficiency is greatly affected by the tightness of the outer approximation. In this paper, we present a new framework for strengthening cutting planes of nonlinear convex constraints, to obtain tighter outer approximations. The strengthened cuts can give a tighter continuous relaxation and an overall tighter representation of the nonlinear constraints. The cuts are strengthened by analyzing disjunctive structures in the MINLP problem, and we present two types of strengthened cuts. The first type of cut is obtained by reducing the right-hand side value of the original cut, such that it forms the tightest generally valid inequality for a chosen disjunction. The second type of cut effectively uses individual right-hand side values for each term of the disjunction. We prove that both types of cuts are valid and that the second type of cut can dominate both the first type and the original cut. We use the cut strengthening in conjunction with the extended supporting hyperplane algorithm, and numerical results show that the strengthening can significantly reduce both the number of iterations and the time needed to solve convex MINLP problems.
BibTeX:
@article{Kronqvist2020,
  author = {Jan Kronqvist and Ruth Misener},
  title = {A disjunctive cut strengthening technique for convex MINLP},
  journal = {Optimization and Engineering},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  doi = {10.1007/s11081-020-09551-6}
}
Kurt S, Sukumaran-Rajam A, Rastello F and Sadayappan P (2020), "Efficient Tiled Sparse Matrix Multiplication through Matrix Signatures", In Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis. Los Alamitos, CA, USA, 11, 2020. , pp. 1234-1247. IEEE Computer Society.
Abstract: Tiling is a key technique to reduce data movement in matrix computations. While tiling is well understood and widely used for dense matrix/tensor computations, effective tiling of sparse matrix computations remains a challenging problem. This paper proposes a novel method to efficiently summarize the impact of the sparsity structure of a matrix on achievable data reuse as a one-dimensional signature, which is then used to build an analytical cost model for tile size optimization for sparse matrix computations. The proposed model-driven approach to sparse tiling is evaluated on two key sparse matrix kernels; Sparse Matrix-Matrix Multiplication (SpMM) and Sampled Dense Dense Matrix Multiplication (SDDMM). Experimental results demonstrate that model-based tiled SpMM and SDDMM achieve high performance relative to the current state-of-the-art.
BibTeX:
@inproceedings{Kurt2020,
  author = {S. Kurt and A. Sukumaran-Rajam and F. Rastello and P. Sadayappan},
  title = {Efficient Tiled Sparse Matrix Multiplication through Matrix Signatures},
  booktitle = {Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis},
  publisher = {IEEE Computer Society},
  year = {2020},
  pages = {1234--1247},
  url = {https://doi.ieeecomputersociety.org/10.1109/SC41405.2020.00091},
  doi = {10.1109/SC41405.2020.00091}
}
Kwasniewski G, Ben-Nun T, Ziogas AN, Schneider T, Besta M and Hoefler T (2020), "On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal LU Factorization", October, 2020.
Abstract: Dense linear algebra kernels, such as linear solvers or tensor contractions, are fundamental components of many scientific computing applications. In this work, we present a novel method of deriving parallel I/O lower bounds for this broad family of programs. Based on the X-partitioning abstraction, our method explicitly captures inter-statement dependencies. Applying our analysis to LU factorization, we derive COnfLUX, an LU algorithm with the parallel I/O cost of N^3 / (P M) communicated elements per processor -- only 1/3× over our established lower bound. We evaluate COnfLUX on various problem sizes, demonstrating empirical results that match our theoretical analysis, communicating asymptotically less than Cray ScaLAPACK or SLATE, and outperforming the asymptotically-optimal CANDMC library. Running on 1,024 nodes of Piz Daint, COnfLUX communicates 1.6× less than the second-best implementation and is expected to communicate 2.1× less on a full-scale run on Summit.
BibTeX:
@article{Kwasniewski2020,
  author = {Grzegorz Kwasniewski and Tal Ben-Nun and Alexandros Nikolaos Ziogas and Timo Schneider and Maciej Besta and Torsten Hoefler},
  title = {On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal LU Factorization},
  year = {2020}
}
Laguna I (2020), "Varity: Quantifying Floating-Point Variations in HPC Systems Through Randomized Testing", In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)., 5, 2020. IEEE.
Abstract: Floating-point arithmetic can be confusing and it is sometimes misunderstood by programmers. While numerical reproducibility is desirable in HPC, it is often unachievable due to the different ways compilers treat floating-point arithmetic and generate code around it. This reproducibility problem is exacerbated in heterogeneous HPC systems where code can be executed on different floating-point hardware, e.g., a host and a device architecture, producing in some situations different numerical results. We present VARITY, a tool to quantify floatingpoint variations in heterogeneous HPC systems. Our approach generates random test programs for multiple architectures (host and device) using the compilers that are available in the system. Using differential testing, it compares floating-point results and identifies unexpected variations in the program results. The results can guide programmers in choosing the compilers that produce the most similar results in a system, which is useful when numerical reproducibility is critical. By running 50,000 experiments with Varity on a system with IBM POWER9 CPUs, NVIDIA V100 GPUs, and four compilers (gcc, clang, xl, and nvcc), we identify and document several programs that produce significantly different results for a given input when different compilers or architectures are used, even when a similar optimization level is used everywhere.
BibTeX:
@inproceedings{Laguna2020,
  author = {Ignacio Laguna},
  title = {Varity: Quantifying Floating-Point Variations in HPC Systems Through Randomized Testing},
  booktitle = {2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
  publisher = {IEEE},
  year = {2020},
  doi = {10.1109/ipdps47924.2020.00070}
}
Lai M-J, Xie J and Xu Z (2020), "Graph Sparsification by Universal Greedy Algorithms", July, 2020.
Abstract: Graph sparsification is to approximate an arbitrary graph by a sparse graph and is useful in many applications, such as simplification of social networks, least squares problems, numerical solution of symmetric positive definite linear systems and etc. In this paper, inspired by the well-known sparse signal recovery algorithm called orthogonal matching pursuit (OMP), we introduce a deterministic, greedy edge selection algorithm called universal greedy algorithm(UGA) for graph sparsification. The UGA algorithm can output a (1+\epsilon)^2(1-)^2-spectral sparsifier with ⌈n2⌉ edges in O(m+n^2/2) time with m edges and n vertices for a general random graph satisfying a mild sufficient condition. This is a linear time algorithm in terms of the number of edges that the community of graph sparsification is looking for. The best result in the literature to the knowledge of the authors is the existence of a deterministic algorithm which is almost linear, i.e. O(m^1+o(1)) for some o(1)=O((\log\log(m))^{2/3}1/3(m)). We shall point out that several random graphs satisfy the sufficient condition and hence, can be sparsified in linear time. For a general spectral sparsification problem, e.g., positive subset selection problem, a nonnegative UGA algorithm is proposed which needs O(mn^2+ n^3/2) time and the convergence is established.
BibTeX:
@article{Lai2020,
  author = {Ming-Jun Lai and Jiaxin Xie and Zhiqiang Xu},
  title = {Graph Sparsification by Universal Greedy Algorithms},
  year = {2020}
}
Lambert M (2020), "Analysis of High Performance Sparse Matrix-Vector Multiplication for Small Finite Fields". Thesis at: University of Delaware.
Abstract: This thesis explores the intricacies of obtaining high performance sparse matrixvector multiplication on modern hardware, with an emphasis on operating over data from small finite fields. We develop and present novel adaptations of classical, fast matrix multiplication algorithms over these fields and apply them in a sparse setting. Further, we compare these novel formats and algorithms to a wide array of standard sparse matrix formats. In particular, we use modern code analysis tools to show how data layouts, vector/panel widths, and choice of compiler all substantially influence the efficiency of the underlying arithmetic of these various sparse formats. These analyses are performed in a data-agnostic manner, meaning we focus solely on the efficiency of the arithmetic and not on higher-level aspects of performance such as cache access patterns. In particular, we analyze the assembly code produced by compilers, removing matrix-specific intangibles from the discussion of format. These are still important considerations when considering any specific matrix, but they make comparing general formats to one another difficult, without exhaustive benchmarking. These results show the theoretical peak arithmetic performance, which we discuss in this abstract, analytic perspective. We see similar trends in synthetic performance benchmarks. Ultimately, we show that the Method of the Four Russians can be directly adapted to sparse matrix-panel (a matrix with relatively few columns) multiplication and that a custom, high-performance variant can achieve high performance in sparse matrix-vector multiplication, with potential to perform better in real-world matrices.
BibTeX:
@phdthesis{Lambert2020,
  author = {Lambert, Matthew},
  title = {Analysis of High Performance Sparse Matrix-Vector Multiplication for Small Finite Fields},
  school = {University of Delaware},
  year = {2020}
}
Lee H, Wong D, Hoang L, Dathathri R, Gill G, Jatala V, Kuck D and Pingali K (2020), "A Study of APIs for Graph Analytics Workloads", In Proceedings of the 2020 IEEE International Symposium on Workload Characterization.
Abstract: Traditionally, parallel graph analytics workloads have been implemented in systems like Pregel, GraphLab, Galois, and Ligra that support graph data structures and graph operations directly. An alternative approach is to express graph workloads in terms of sparse matrix kernels such as sparse matrix-vector and matrix-matrix multiplication. An API for these kernels has been defined by the GraphBLAS project. The SuiteSparse project has implemented this API on shared-memory platforms, and the LAGraph project is building a library of graph algorithms using this API.\ How does the matrix-based approach perform compared to the graph-based approach? Our experiments on a 56 core CPU show that for representative graph workloads, LAGraph/SuiteSparse solutions are 5× slower on the average than Galois solutions. We argue that this performance gap arises from inherent limitations of a matrix-based API: regardless of which architecture a matrixbased algorithm is run on, it is subject to the same inherent limitations of the matrix-based API.
BibTeX:
@inproceedings{Lee2020,
  author = {Hochan Lee and David Wong and Loc Hoang and Roshan Dathathri and Gurbinder Gill and Vishwesh Jatala and David Kuck and Keshav Pingali},
  title = {A Study of APIs for Graph Analytics Workloads},
  booktitle = {Proceedings of the 2020 IEEE International Symposium on Workload Characterization},
  year = {2020}
}
Legat B, Dowson O, Garcia JD and Lubin M (2020), "MathOptInterface: a data structure for mathematical optimization problems", February, 2020.
Abstract: JuMP is an open-source algebraic modeling language in the Julia language. In this work, we discuss a complete re-write of JuMP based on a novel abstract data structure, which we call MathOptInterface, for representing instances of mathematical optimization problems. MathOptInterface is significantly more general than existing data structures in the literature, encompassing, for example, a spectrum of problems classes from integer programming with indicator constraints to bilinear semidefinite programming. We highlight the challenges that arise from this generality, and how we overcame them in the re-write of JuMP.
BibTeX:
@article{Legat2020,
  author = {Benoit Legat and Oscar Dowson and Joaquim Dias Garcia and Miles Lubin},
  title = {MathOptInterface: a data structure for mathematical optimization problems},
  year = {2020}
}
Leleux P, Courtain S, Guex G and Saerens M (2020), "Sparse Randomized Shortest Paths Routing with Tsallis Divergence Regularization", July, 2020.
Abstract: This work elaborates on the important problem of (1) designing optimal randomized routing policies for reaching a target node t from a source note s on a weighted directed graph G and (2) defining distance measures between nodes interpolating between the least cost (based on optimal movements) and the commute-cost (based on a random walk on G), depending on a temperature parameter T. To this end, the randomized shortest path formalism (RSP, [2,99,124]) is rephrased in terms of Tsallis divergence regularization, instead of Kullback-Leibler divergence. The main consequence of this change is that the resulting routing policy (local transition probabilities) becomes sparser when T decreases, therefore inducing a sparse random walk on G converging to the least-cost directed acyclic graph when T tends to 0. Experimental comparisons on node clustering and semi-supervised classification tasks show that the derived dissimilarity measures based on expected routing costs provide state-of-the-art results. The sparse RSP is therefore a promising model of movements on a graph, balancing sparse exploitation and exploration in an optimal way.
BibTeX:
@article{Leleux2020,
  author = {Pierre Leleux and Sylvain Courtain and Guillaume Guex and Marco Saerens},
  title = {Sparse Randomized Shortest Paths Routing with Tsallis Divergence Regularization},
  year = {2020}
}
León G, Badı́a JM, Belloch JA, Lindoso A and Entrena L (2020), "Evaluating the soft error sensitivity of a GPU-based SoC for matrix multiplication", Microelectronics Reliability., 10, 2020. Elsevier BV.
Abstract: System-on-Chip (SoC) devices can be composed of low-power multicore processors combined with a small graphics accelerator (or GPU) which offers a trade-off between computational capacity and low-power consumption. In this work we use the LLFI-GPU fault injection tool on one of these devices to compare the sensitivity to soft errors of two different CUDA versions of matrix multiplication benchmark. Specifically, we perform fault injection campaigns on a Jetson TK1 development kit, a board equipped with a SoC including an NVIDIA "Kepler" Graphics Processing Unit (GPU). We evaluate the effect of modifying the size of the problem and also the thread-block size on the behaviour of the algorithms. Our results show that the block version of the matrix multiplication benchmark that leverages the shared memory of the GPU is not only faster than the element-wise version, but it is also much more resilient to soft errors. We also use the cuda-gdb debugger to analyze the main causes of the crashes in the code due to soft errors. Our experiments show that most of the errors are due to accesses to invalid positions of the different memories of the GPU, which causes that the block version suffers a higher percentage of this kind of errors.
BibTeX:
@article{Leon2020,
  author = {Germán León and José M. Badı́a and Jose A. Belloch and Almudena Lindoso and Luis Entrena},
  title = {Evaluating the soft error sensitivity of a GPU-based SoC for matrix multiplication},
  journal = {Microelectronics Reliability},
  publisher = {Elsevier BV},
  year = {2020},
  doi = {10.1016/j.microrel.2020.113856}
}
Levy R, Solomonik E and Clark BK (2020), "Distributed-Memory DMRG via Sparse and Dense Parallel Tensor Contractions", July, 2020.
Abstract: The Density Matrix Renormalization Group (DMRG) algorithm is a powerful tool for solving eigenvalue problems to model quantum systems. DMRG relies on tensor contractions and dense linear algebra to compute properties of condensed matter physics systems. However, its efficient parallel implementation is challenging due to limited concurrency, large memory footprint, and tensor sparsity. We mitigate these problems by implementing two new parallel approaches that handle block sparsity arising in DMRG, via Cyclops, a distributed memory tensor contraction library. We benchmark their performance on two physical systems using the Blue Waters and Stampede2 supercomputers. Our DMRG performance is improved by up to 5.9X in runtime and 99X in processing rate over ITensor, at roughly comparable computational resource use. This enables higher accuracy calculations via larger tensors for quantum state approximation. We demonstrate that despite having limited concurrency, DMRG is weakly scalable with the use of efficient parallel tensor contraction mechanisms.
BibTeX:
@article{Levy2020,
  author = {Ryan Levy and Edgar Solomonik and Bryan K. Clark},
  title = {Distributed-Memory DMRG via Sparse and Dense Parallel Tensor Contractions},
  year = {2020}
}
Li J, Lakshminarasimhan M, Wu X, Li A, Olschanowsky C and Barker K (2020), "A Parallel Sparse Tensor Benchmark Suite on CPUs and GPUs", January, 2020.
Abstract: Tensor computations present significant performance challenges that impact a wide spectrum of applications ranging from machine learning, healthcare analytics, social network analysis, data mining to quantum chemistry and signal processing. Efforts to improve the performance of tensor computations include exploring data layout, execution scheduling, and parallelism in common tensor kernels. This work presents a benchmark suite for arbitrary-order sparse tensor kernels using state-of-the-art tensor formats: coordinate (COO) and hierarchical coordinate (HiCOO) on CPUs and GPUs. It presents a set of reference tensor kernel implementations that are compatible with real-world tensors and power law tensors extended from synthetic graph generation techniques. We also propose Roofline performance models for these kernels to provide insights of computer platforms from sparse tensor view.
BibTeX:
@article{Li2020,
  author = {Jiajia Li and Mahesh Lakshminarasimhan and Xiaolong Wu and Ang Li and Catherine Olschanowsky and Kevin Barker},
  title = {A Parallel Sparse Tensor Benchmark Suite on CPUs and GPUs},
  year = {2020}
}
Li R and Zhang C (2020), "Efficient Parallel Implementations of Sparse Triangular Solves for GPU Architectures", In Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing., 1, 2020. , pp. 106-117. Society for Industrial and Applied Mathematics.
Abstract: The sparse triangular matrix solve (SpTrSV) is an important computation kernel that is demanded by a variety of numerical methods such as the Gauss-Seidel iterations. However, developing efficient parallel algorithms for SpTrSV that are suitable for GPUs remains a challenging task due to the inherently sequential nature in the solve. In this paper, we revisit this problem by reviewing several parallel algorithms based on different task scheduling and different sparse matrix storage schemes, proposing modifications to the existing methods that can greatly improve the performance, and describing the implementations in detail. Numerical results of Gauss-Seidel iterations with structured and unstructured matrices make evident the superiority of the proposed algorithms and implementations comparing with state-of-the-art methods in the literature.
BibTeX:
@incollection{Li2020a,
  author = {Ruipeng Li and Chaoyu Zhang},
  title = {Efficient Parallel Implementations of Sparse Triangular Solves for GPU Architectures},
  booktitle = {Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing},
  publisher = {Society for Industrial and Applied Mathematics},
  year = {2020},
  pages = {106--117},
  doi = {10.1137/1.9781611976137.10}
}
Li S, Li Q, Zhu Z, Tang G and Wakin MB (2020), "The Global Geometry of Centralized and Distributed Low-rank Matrix Recovery without Regularization", March, 2020.
Abstract: Low-rank matrix recovery is a fundamental problem in signal processing and machine learning. A recent very popular approach to recovering a low-rank matrix X is to factorize it as a product of two smaller matrices, i.e., X=UV^T, and then optimize over U, V instead of X. Despite the resulting non-convexity, recent results have shown that the factorized objective functions have benign global geometry---with no spurious local minima and satisfying the so-called strict saddle property---ensuring convergence to a global minimum for many local-search algorithms. However, most of these results actually consider a modified cost function that includes a balancing regularizer. While useful for deriving theory, this balancing regularizer does not appear to be necessary in practice. In this work, we close this theory-practice gap by proving that the original factorized non-convex problem, without the balancing regularizer, also has similar benign global geometry. Moreover, we also extend our theoretical results to the field of distributed optimization.
BibTeX:
@article{Li2020b,
  author = {Shuang Li and Qiuwei Li and Zhihui Zhu and Gongguo Tang and Michael B. Wakin},
  title = {The Global Geometry of Centralized and Distributed Low-rank Matrix Recovery without Regularization},
  year = {2020}
}
Li PH, Lee T and Youn HY (2020), "Dimensionality Reduction with Sparse Locality for Principal Component Analysis", Mathematical Problems in Engineering., 5, 2020. Vol. 2020, pp. 1-12. Hindawi Limited.
Abstract: Various dimensionality reduction (DR) schemes have been developed for projecting high-dimensional data into low-dimensional representation. The existing schemes usually preserve either only the global structure or local structure of the original data, but not both. To resolve this issue, a scheme called sparse locality for principal component analysis (SLPCA) is proposed. In order to effectively consider the trade-off between the complexity and efficiency, a robust L_2,p-norm-based principal component analysis (R2P-PCA) is introduced for global DR, while sparse representation-based locality preserving projection (SR-LPP) is used for local DR. Sparse representation is also employed to construct the weighted matrix of the samples. Being parameter-free, this allows the construction of an intrinsic graph more robust against the noise. In addition, simultaneous learning of projection matrix and sparse similarity matrix is possible. Experimental results demonstrate that the proposed scheme consistently outperforms the existing schemes in terms of clustering accuracy and data reconstruction error.
BibTeX:
@article{Li2020c,
  author = {Pei Heng Li and Taeho Lee and Hee Yong Youn},
  title = {Dimensionality Reduction with Sparse Locality for Principal Component Analysis},
  journal = {Mathematical Problems in Engineering},
  publisher = {Hindawi Limited},
  year = {2020},
  volume = {2020},
  pages = {1--12},
  doi = {10.1155/2020/9723279}
}
Li R and Wang Z-Q (2020), "Restrictively Preconditioned Conjugate Gradient Method for a Series of Constantly Augmented Least Squares Problems", SIAM Journal on Matrix Analysis and Applications., 1, 2020. Vol. 41(2), pp. 838-851. Society for Industrial & Applied Mathematics (SIAM).
Abstract: In this study, we analyze the real-time solution of a series of augmented least squares problems, which are generated by adding information to an original least squares model repetitively. Instead of solving the least squares problems directly, we transform them into a batch of saddle point linear systems and subsequently solve the linear systems using restrictively preconditioned conjugate gradient (RPCG) methods. Approximation of the new Schur complement is generated effectively based on a previously approximated Schur complement. Owing to the variations of the preconditioned conjugate gradient method, the proposed methods generate convergence results similar to the conjugate gradient method and achieve a very fast convergent iterative sequence when the coefficient matrix is well preconditioned. Numerical tests show that the new methods are more effective than some standard Krylov subspace methods. Updated RPCG methods meet the requirement of real-time computing successfully for multifactor models.
BibTeX:
@article{Li2020d,
  author = {Rui Li and Zeng-Qi Wang},
  title = {Restrictively Preconditioned Conjugate Gradient Method for a Series of Constantly Augmented Least Squares Problems},
  journal = {SIAM Journal on Matrix Analysis and Applications},
  publisher = {Society for Industrial & Applied Mathematics (SIAM)},
  year = {2020},
  volume = {41},
  number = {2},
  pages = {838--851},
  doi = {10.1137/19m1284853}
}
Li H, Fang C and Lin Z (2020), "Accelerated First-Order Optimization Algorithms for Machine Learning", Proceedings of the IEEE. , pp. 1-16. Institute of Electrical and Electronics Engineers (IEEE).
Abstract: Numerical optimization serves as one of the pillars of machine learning. To meet the demands of big data applications, lots of efforts have been put on designing theoretically and practically fast algorithms. This article provides a comprehensive survey on accelerated first-order algorithms with a focus on stochastic algorithms. Specifically, this article starts with reviewing the basic accelerated algorithms on deterministic convex optimization, then concentrates on their extensions to stochastic convex optimization, and at last introduces some recent developments on acceleration for nonconvex optimization.
BibTeX:
@article{Li2020e,
  author = {Huan Li and Cong Fang and Zhouchen Lin},
  title = {Accelerated First-Order Optimization Algorithms for Machine Learning},
  journal = {Proceedings of the IEEE},
  publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
  year = {2020},
  pages = {1--16},
  doi = {10.1109/jproc.2020.3007634}
}
Li C, Tang M, Tong R, Cai M, Zhao J and Manocha D (2020), "P-Cloth: Interactive Complex Cloth Simulation on Multi-GPU Systems using Dynamic Matrix Assembly and Pipelined Implicit Integrators", August, 2020.
Abstract: We present a novel parallel algorithm for cloth simulation that exploits multiple GPUs for fast computation and the handling of very high resolution meshes. To accelerate implicit integration, we describe new parallel algorithms for sparse matrix-vector multiplication (SpMV) and for dynamic matrix assembly on a multi-GPU workstation. Our algorithms use a novel work queue generation scheme for a fat-tree GPU interconnect topology. Furthermore, we present a novel collision handling scheme that uses spatial hashing for discrete and continuous collision detection along with a non-linear impact zone solver. Our parallel schemes can distribute the computation and storage overhead among multiple GPUs and enable us to perform almost interactive simulation on complex cloth meshes, which can hardly be handled on a single GPU due to memory limitations. We have evaluated the performance with two multi-GPU workstations (with 4 and 8 GPUs, respectively) on cloth meshes with 0.5-1.65M triangles. Our approach can reliably handle the collisions and generate vivid wrinkles and folds at 2-5 fps, which is significantly faster than prior cloth simulation systems. We observe almost linear speedups with respect to the number of GPUs.
BibTeX:
@article{Li2020f,
  author = {Cheng Li and Min Tang and Ruofeng Tong and Ming Cai and Jieyi Zhao and Dinesh Manocha},
  title = {P-Cloth: Interactive Complex Cloth Simulation on Multi-GPU Systems using Dynamic Matrix Assembly and Pipelined Implicit Integrators},
  year = {2020}
}
Li M, Xiao C and Yang C (2020), "a-Tucker: Input-Adaptive and Matricization-Free Tucker Decomposition for Dense Tensors on CPUs and GPUs", October, 2020.
Abstract: Tucker decomposition is one of the most popular models for analyzing and compressing large-scale tensorial data. Existing Tucker decomposition algorithms usually rely on a single solver to compute the factor matrices and core tensor, and are not flexible enough to adapt with the diversities of the input data and the hardware. Moreover, to exploit highly efficient GEMM kernels, most Tucker decomposition implementations make use of explicit matricizations, which could introduce extra costs in terms of data conversion and memory usage. In this paper, we present a-Tucker, a new framework for input-adaptive and matricization-free Tucker decomposition of dense tensors. A mode-wise flexible Tucker decomposition algorithm is proposed to enable the switch of different solvers for the factor matrices and core tensor, and a machine-learning adaptive solver selector is applied to automatically cope with the variations of both the input data and the hardware. To further improve the performance and enhance the memory efficiency, we implement a-Tucker in a fully matricization-free manner without any conversion between tensors and matrices. Experiments with a variety of synthetic and real-world tensors show that a-Tucker can substantially outperform existing works on both CPUs and GPUs.
BibTeX:
@article{Li2020g,
  author = {Min Li and Chuanfu Xiao and Chao Yang},
  title = {a-Tucker: Input-Adaptive and Matricization-Free Tucker Decomposition for Dense Tensors on CPUs and GPUs},
  year = {2020}
}
Li J, Lakshminarasimhan M, Wu X, Li A, Olschanowsky C and Barker K (2020), "A Sparse Tensor Benchmark Suite for CPUs and GPUs", In Proceedings of the IEEE International Symposium on Workload Characterization., 10, 2020. IEEE.
Abstract: Tensor computations present significant performance challenges that impact a wide spectrum of applications ranging from machine learning, healthcare analytics, social network analysis, data mining to quantum chemistry and signal processing. Efforts to improve the performance of tensor computations include exploring data layout, execution scheduling, and parallelism in common tensor kernels. This work presents a benchmark suite for arbitrary-order sparse tensor kernels using state-of-the-art tensor formats: coordinate (COO) and hierarchical coordinate (HiCOO) on CPUs and GPUs. It presents a set of reference tensor kernel implementations that are compatible with real-world tensors and power law tensors extended from synthetic graph generation techniques. We also propose Roofline performance models for these kernels to provide insights of computer platforms from sparse tensor view. This benchmark suite along with the synthetic tensor generator is publicly available.
BibTeX:
@inproceedings{Li2020h,
  author = {Jiajia Li and Mahesh Lakshminarasimhan and Xiaolong Wu and Ang Li and Catherine Olschanowsky and Kevin Barker},
  title = {A Sparse Tensor Benchmark Suite for CPUs and GPUs},
  booktitle = {Proceedings of the IEEE International Symposium on Workload Characterization},
  publisher = {IEEE},
  year = {2020},
  doi = {10.1109/iiswc50251.2020.00027}
}
Li SE, Wang Z, Zheng Y, Sun Q, Gao J, Ma F and Li K (2020), "Synchronous and asynchronous parallel computation for large-scale optimal control of connected vehicles", Transportation Research Part C: Emerging Technologies., December, 2020. Vol. 121, pp. 102842. Elsevier BV.
Abstract: Connected vehicles is an important intelligent transportation system to improve the traffic performance. This paper proposes two parallel computation algorithms to solve a large-scale optimal control problem in the coordination of multiple connected vehicles. The coordination is formulated as a centralized optimization problem in the receding horizon fashion. A decentralized computation network is designed to facilitate the development of parallel algorithms. We use Taylor series to linearize non-convex constraints, and introduce a set of consensus constraints to transform the centralized problem to a standard consensus optimization problem. A synchronous parallel algorithm is firstly proposed to solve the consensus optimization problem by applying the alternating direction method of multipliers (ADMM). The ADMM framework allows us to decompose the coupling constraints and decision variables, leading to parallel iterations for each vehicle in a synchronous fashion. We then propose an asynchronous version of the parallel algorithm that allows the vehicles to update their variables asynchronously in the computation network. The effectiveness and efficiency of the proposed algorithms are validated by extensive numerical simulations.
BibTeX:
@article{Li2020i,
  author = {Shengbo Eben Li and Zhitao Wang and Yang Zheng and Qi Sun and Jiaxin Gao and Fei Ma and Keqiang Li},
  title = {Synchronous and asynchronous parallel computation for large-scale optimal control of connected vehicles},
  journal = {Transportation Research Part C: Emerging Technologies},
  publisher = {Elsevier BV},
  year = {2020},
  volume = {121},
  pages = {102842},
  doi = {10.1016/j.trc.2020.102842}
}
Li H, Li Z, Li K, Rellermeyer JS, Chen LY and Li K (2020), "SGD_Tucker: A Novel Stochastic Optimization Strategy for Parallel Sparse Tucker Decomposition", December, 2020.
Abstract: Sparse Tucker Decomposition (STD) algorithms learn a core tensor and a group of factor matrices to obtain an optimal low-rank representation feature for the High-Order, High-Dimension, and Sparse Tensor (HOHDST). However, existing STD algorithms face the problem of intermediate variables explosion which results from the fact that the formation of those variables, i.e., matrices Khatri-Rao product, Kronecker product, and matrix-matrix multiplication, follows the whole elements in sparse tensor. The above problems prevent deep fusion of efficient computation and big data platforms. To overcome the bottleneck, a novel stochastic optimization strategy (SGD_Tucker) is proposed for STD which can automatically divide the high-dimension intermediate variables into small batches of intermediate matrices. Specifically, SGD_Tucker only follows the randomly selected small samples rather than the whole elements, while maintaining the overall accuracy and convergence rate. In practice, SGD_Tucker features the two distinct advancements over the state of the art. First, SGD_Tucker can prune the communication overhead for the core tensor in distributed settings. Second, the low data-dependence of SGD_Tucker enables fine-grained parallelization, which makes SGD_Tucker obtaining lower computational overheads with the same accuracy. Experimental results show that SGD_Tucker runs at least 2X faster than the state of the art.
BibTeX:
@article{Li2020j,
  author = {Hao Li and Zixuan Li and Kenli Li and Jan S. Rellermeyer and Lydia Y. Chen and Keqin Li},
  title = {SGD_Tucker: A Novel Stochastic Optimization Strategy for Parallel Sparse Tucker Decomposition},
  year = {2020}
}
Li L, Kameoka H and Makino S (2020), "Majorization-Minimization Algorithm for Discriminative Non-Negative Matrix Factorization", IEEE Access. Vol. 8, pp. 227399-227408. Institute of Electrical and Electronics Engineers (IEEE).
Abstract: This paper proposes a basis training algorithm for discriminative non-negative matrix factorization (NMF) with applications to single-channel audio source separation. With an NMF-based approach to supervised audio source separation, NMF is first applied to train the basis spectra of each source using training examples and then applied to the spectrogram of a mixture signal using the pretrained basis spectra at test time. The source signals can then be separated out using a Wiener filter. Here, a typical way to train the basis spectra is to minimize the dissimilarity measure between the observed spectrogram and the NMF model. However, obtaining the basis spectra in this way does not ensure that the separated signal will be optimal at test time due to the inconsistency between the objective functions for training and separation (Wiener filtering). To address this mismatch, a framework called discriminative NMF (DNMF) has recently been proposed. While this framework is noteworthy in that it uses a common objective function for training and separation, the objective function becomes more analytically complex than that of regular NMF. In the original DNMF work, a multiplicative update algorithm was proposed for the basis training; however, the convergence of the algorithm is not guaranteed and can be very slow. To overcome this weakness, this paper proposes a convergence-guaranteed algorithm for DNMF based on a majorization-minimization principle. Experimental results show that the proposed algorithm outperform the conventional DNMF algorithm as well as the regular NMF algorithm in terms of both the signal-to-distortion and signal-tointerference ratios.
BibTeX:
@article{Li2020k,
  author = {Li Li and Hirokazu Kameoka and Shoji Makino},
  title = {Majorization-Minimization Algorithm for Discriminative Non-Negative Matrix Factorization},
  journal = {IEEE Access},
  publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
  year = {2020},
  volume = {8},
  pages = {227399--227408},
  doi = {10.1109/access.2020.3045791}
}
Liang L, Sun D and Toh K-C (2020), "An Inexact Augmented Lagrangian Method for Second-order Cone Programming with Applications", October, 2020.
Abstract: In this paper, we adopt the augmented Lagrangian method (ALM) to solve convex quadratic second-order cone programming problems (SOCPs). Fruitful results on the efficiency of the ALM have been established in the literature. Recently, it has been shown by Cui et al. (2019) that, if the quadratic growth condition holds at an optimal solution for the dual problem, then the KKT residual converges to zero R-superlinearly when the ALM is applied to the primal problem. Moreover, Cui et al (2017) provided sufficient conditions for the quadratic growth condition to hold under the metric subregularity and bounded linear regularity conditions for solving composite matrix optimization problems involving spectral functions. Here, we adopt these recent ideas to analyze the convergence properties of the ALM when applied to SOCPs. To the best of our knowledge, no similar work has been done for SOCPs so far. In our paper, we first provide sufficient conditions to ensure the quadratic growth condition for SOCPs. With these elegant theoretical guarantees, we then design an SOCP solver and apply it to solve various classes of SOCPs such as minimal enclosing ball problems, classic trust-region subproblems, square-root Lasso problems, and DIMACS Challenge problems. Numerical results show that the proposed ALM based solver is efficient and robust compared to the existing highly developed solvers such as Mosek and SDPT3.
BibTeX:
@article{Liang2020,
  author = {Ling Liang and Defeng Sun and Kim-Chuan Toh},
  title = {An Inexact Augmented Lagrangian Method for Second-order Cone Programming with Applications},
  year = {2020}
}
Liers F, Martin A, Merkert M and Michaels NMD (2020), "Towards the Solution of Mixed-Integer Nonlinear Optimization Problems using Simultaneous Convexification"
Abstract: Solving mixed-integer nonlinear optimization problems (MINLPs) to global optimality is extremely challenging. An important step for enabling their solution consists in the design of convex relaxations of the feasible set. Known solution approaches based on spatial branch-and-bound become more effective the tighter the used relaxations are. Relaxations are commonly established by convex underestimators, where each constraint function is considered separately. Instead, a considerably tighter relaxation can be found via so-called simultaneous convexification, where convex underestimators are derived for more than one constraint at a time. In this work, we present a global solution approach for solving mixed-integer nonlinear problems that uses simultaneous convexification. We introduce a separation method for the convex hull of constrained sets. It relies on determining the convex envelope of linear combinations of the constraints and on solving a nonsmooth convex problem. In particular, we apply the method to quadratic absolute value functions and derive their convex envelopes. The practicality of the proposed solution approach is demonstrated on several test instances from gas network optimization, where the method outperforms standard approaches that use separate convex relaxations.
BibTeX:
@article{Liers2020,
  author = {Frauke Liers and Alexander Martin and Maximilian Merkert and Nick Mertens Dennis Michaels},
  title = {Towards the Solution of Mixed-Integer Nonlinear Optimization Problems using Simultaneous Convexification},
  year = {2020}
}
Lin T, Jin C and Jordan MI (2020), "Near-Optimal Algorithms for Minimax Optimization", February, 2020.
Abstract: This paper resolves a longstanding open question pertaining to the design of near-optimal first-order algorithms for smooth and strongly-convex-strongly-concave minimax problems. Current state-of-the-art first-order algorithms find an approximate Nash equilibrium using O(_mathbf x+_mathbf y) or O(min\kappa_{\mathbf x}\sqrt{\kappa_{\mathbf y}}, \sqrt{\kappa_{\mathbf x}}\kappa_{\mathbf y}\) gradient evaluations, where _mathbf x and _mathbf y are the condition numbers for the strong-convexity and strong-concavity assumptions. A gap remains between these results and the best existing lower bound \Omega(\kappa_{\mathbf x}\kappa_{\mathbf y}). This paper presents the first algorithm with O(\kappa_{\mathbf x}\kappa_{\mathbf y}) gradient complexity, matching the lower bound up to logarithmic factors. Our new algorithm is designed based on an accelerated proximal point method and an accelerated solver for minimax proximal steps. It can be easily extended to the settings of strongly-convex-concave, convex-concave, nonconvex-strongly-concave, and nonconvex-concave functions. This paper also presents algorithms that match or outperform all existing methods in these settings in terms of gradient complexity, up to logarithmic factors.
BibTeX:
@article{Lin2020,
  author = {Tianyi Lin and Chi Jin and Michael. I. Jordan},
  title = {Near-Optimal Algorithms for Minimax Optimization},
  year = {2020}
}
Lin Z, Li H and Fang C (2020), "Accelerated Algorithms for Constrained Convex Optimization", In Accelerated Optimization for Machine Learning. , pp. 57-108. Springer Singapore.
Abstract: This chapter reviews the representative accelerated algorithms for deterministic constrained convex optimization. We overview the accelerated penalty method, accelerated Lagrange multiplier method, and the accelerated augmented Lagrange multiplier method. In particular, we concentrate on two widely used algorithms, namely the alternating direction method of multiplier (ADMM) and the primal-dual method. For ADMM, we study four scenarios, namely the generally convex and nonsmooth case, the strongly convex and nonsmooth case, the generally convex and smooth case, and the strongly convex and smooth case. We also introduce its non-ergodic accelerated variant. For the primal-dual method, we study three scenarios: both the two functions are generally convex, both are strongly convex, and one is generally convex, while the other is strongly convex. Finally, we introduce the Frank–Wolfe algorithm under the condition of strongly convex constraint set.
BibTeX:
@incollection{Lin2020a,
  author = {Zhouchen Lin and Huan Li and Cong Fang},
  title = {Accelerated Algorithms for Constrained Convex Optimization},
  booktitle = {Accelerated Optimization for Machine Learning},
  publisher = {Springer Singapore},
  year = {2020},
  pages = {57--108},
  doi = {10.1007/978-981-15-2910-8_3}
}
Lindquist N, Luszczek P and Dongarra J (2020), "Improving the Performance of the GMRES Method using Mixed-Precision Techniques", November, 2020.
Abstract: The GMRES method is used to solve sparse, non-symmetric systems of linear equations arising from many scientific applications. The solver performance within a single node is memory bound, due to the low arithmetic intensity of its computational kernels. To reduce the amount of data movement, and thus, to improve performance, we investigated the effect of using a mix of single and double precision while retaining double-precision accuracy. Previous efforts have explored reduced precision in the preconditioner, but the use of reduced precision in the solver itself has received limited attention. We found that GMRES only needs double precision in computing the residual and updating the approximate solution to achieve double-precision accuracy, although it must restart after each improvement of single-precision accuracy. This finding holds for the tested orthogonalization schemes: Modified Gram-Schmidt (MGS) and Classical Gram-Schmidt with Re-orthogonalization (CGSR). Furthermore, our mixed-precision GMRES, when restarted at least once, performed 19% and 24% faster on average than double-precision GMRES for MGS and CGSR, respectively. Our implementation uses generic programming techniques to ease the burden of coding implementations for different data types. Our use of the Kokkos library allowed us to exploit parallelism and optimize data management. Additionally, KokkosKernels was used when producing performance results. In conclusion, using a mix of single and double precision in GMRES can improve performance while retaining double-precision accuracy.
BibTeX:
@article{Lindquist2020,
  author = {Neil Lindquist and Piotr Luszczek and Jack Dongarra},
  title = {Improving the Performance of the GMRES Method using Mixed-Precision Techniques},
  year = {2020}
}
Liu H, Ren H, Gu H, Gao F and Yang G (2020), "UNAT: UNstructured Acceleration Toolkit on SW26010 many-core processor", Engineering Computations., 5, 2020. Vol. ahead-of-print(ahead-of-print) Emerald.
Abstract: The purpose of this paper is to provide an automatic parallelization toolkit for unstructured mesh-based computation. Among all kinds of mesh types, unstructured meshes are dominant in engineering simulation scenarios and play an essential role in scientific computations for their geometrical flexibility. However, the high-fidelity applications based on unstructured grids are still time-consuming, no matter for programming or running.\ This study develops an efficient UNstructured Acceleration Toolkit (UNAT), which provides friendly high-level programming interfaces and elaborates lower level implementation on the target hardware to get nearly hand-optimized performance. At the present state, two efficient strategies, a multi-level blocks method and a row-subsections method, are designed and implemented on Sunway architecture. Random memory access and write–write conflict issues of unstructured meshes have been handled by partitioning, coloring and other hardware-specific techniques. Moreover, a data-reuse mechanism is developed to increase the computational intensity and alleviate the memory bandwidth bottleneck.\ The authors select sparse matrix-vector multiplication as a performance benchmark of UNAT across different data layouts and different matrix formats. Experimental results show that the speed-ups reach up to 26× compared to single management processing element, and the utilization ratio tests indicate the capability of achieving nearly hand-optimized performance. Finally, the authors adopt UNAT to accelerate a well-tuned unstructured solver and obtain speed-ups of 19× and 10× on average for main kernels and overall solver, respectively.\ The authors design an unstructured mesh toolkit, UNAT, to link the hardware and numerical algorithm, and then, engineers can focus on the algorithms and solvers rather than the parallel implementation. For the many-core processor SW26010 of the fastest supercomputer in China, UNAT yields up to 26× speed-ups and achieves nearly hand-optimized performance.
BibTeX:
@article{Liu2020,
  author = {Hongbin Liu and Hu Ren and Hanfeng Gu and Fei Gao and Guangwen Yang},
  title = {UNAT: UNstructured Acceleration Toolkit on SW26010 many-core processor},
  journal = {Engineering Computations},
  publisher = {Emerald},
  year = {2020},
  volume = {ahead-of-print},
  number = {ahead-of-print},
  doi = {10.1108/ec-09-2019-0401}
}
Liu Y, Ghysels P, Claus L and Li XS (2020), "Sparse Approximate Multifrontal Factorization with Butterfly Compression for High Frequency Wave Equations", July, 2020.
Abstract: We present a fast and approximate multifrontal solver for large-scale sparse linear systems arising from finite-difference, finite-volume or finite-element discretization of high-frequency wave equations. The proposed solver leverages the butterfly algorithm and its hierarchical matrix extension for compressing and factorizing large frontal matrices via graph-distance guided entry evaluation or randomized matrix-vector multiplication-based schemes. Complexity analysis and numerical experiments demonstrate &Oscr;(N2 N) computation and &Oscr;(N) memory complexity when applied to an N× N sparse system arising from 3D high-frequency Helmholtz and Maxwell problems.
BibTeX:
@article{Liu2020a,
  author = {Yang Liu and Pieter Ghysels and Lisa Claus and Xiaoye Sherry Li},
  title = {Sparse Approximate Multifrontal Factorization with Butterfly Compression for High Frequency Wave Equations},
  year = {2020}
}
Liu J and Wang Z (2020), "A ROM-accelerated parallel-in-time preconditioner for solving all-at-once systems from evolutionary PDEs", December, 2020.
Abstract: In this paper we propose to use model reduction techniques for speeding up the diagonalization-based parallel-in-time (ParaDIAG) preconditioner, for iteratively solving all-at-once systems from evolutionary PDEs. In particular, we use the reduced basis method to seek a low-dimensional approximation to the sequence of complex-shifted systems arising from Step-(b) of the ParaDIAG preconditioning procedure. Different from the standard reduced order modeling that uses the separation of offline and online stages, we have to build the reduced order model (ROM) online for the considered systems at each iteration. Therefore, several heuristic acceleration techniques are introduced in the greedy basis generation algorithm, that is built upon a residual-based error indicator, to further boost up its computational efficiency. Several numerical experiments are conducted, which illustrate the favorable computational efficiency of our proposed ROM-accelerated ParaDIAG preconditioner, in comparison with the state of the art multigrid-based ParaDIAG preconditioner.
BibTeX:
@article{Liu2020b,
  author = {Jun Liu and Zhu Wang},
  title = {A ROM-accelerated parallel-in-time preconditioner for solving all-at-once systems from evolutionary PDEs},
  year = {2020}
}
Liu H, Duan S and Song W (2020), "Improved ADMM for Sparse Reconstruction of Bearing Vibration Signal", In Proceedings of Global Reliability and Prognostics and Health Management., October, 2020. IEEE.
Abstract: The original Alternating Direction Method of Multipliers (ADMM) algorithm is suitable for solution of the convex function problem and is not suitable for nonlinear vibration characteristics. In the paper, the improved ADMM algorithm can be applied to sparse reconstruction of big data, sampled from rolling bearing vibration. On the basis of ADMM, ridge regression is applied to reduce the error of signal reconstruction. The improved ADMM uses a combinatorial optimization, which stores some initial decomposition and makes the subsequent iteration faster during the iteration process. Finally, the case of a rolling bearing vibration signal shows better reconstruction results.
BibTeX:
@inproceedings{Liu2020c,
  author = {He Liu and Shouwu Duan and Wanqing Song},
  title = {Improved ADMM for Sparse Reconstruction of Bearing Vibration Signal},
  booktitle = {Proceedings of Global Reliability and Prognostics and Health Management},
  publisher = {IEEE},
  year = {2020},
  doi = {10.1109/phm-shanghai49105.2020.9280986}
}
Liu Y, Sid-Lakhdar WM, Marques O, Zhu X, Meng C, Demmel JW and Li XS (2020), "GPTune: Multitask Learning for Autotuning Exascale Applications", In Proceedings of the Conference of Principles and Practice of Parallel Programming.
Abstract: Multitask learning has proven to be useful in the field of machine learning when additional knowledge is available to help a prediction task. We adapt this paradigm to develop autotuning frameworks, where the objective is to find the optimal performance parameters of an application code that is treated as a black-box function. Furthermore, we combine multitask learning with multi-objective tuning and incorporation of coarse performance models to enhance the tuning capability. The proposed framework is parallelized and applicable to any application, particularly exascale applications with a small number of function evaluations. Compared with other state-of-the-art single-task learning frameworks, the proposed framework attains up to 2.8X better code performance for at least 80% of all tasks.
BibTeX:
@inproceedings{Liu2020d,
  author = {Yang Liu and Wissam M. Sid-Lakhdar and Osni Marques and Xinran Zhu and Chang Meng and James W. Demmel and Xiaoye S. Li},
  title = {GPTune: Multitask Learning for Autotuning Exascale Applications},
  booktitle = {Proceedings of the Conference of Principles and Practice of Parallel Programming},
  year = {2020},
  doi = {10.1145/3437801.3441621}
}
Loe JA, Thornquist HK and Boman EG (2020), "Polynomial Preconditioned GMRES in Trilinos: Practical Considerations for High-Performance Computing", In Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing., 1, 2020. , pp. 35-45. Society for Industrial and Applied Mathematics.
Abstract: Polynomial preconditioners for GMRES and other Krylov solvers are well-known but are infrequently used in large-scale software libraries or applications. This may be due to stability problems or complicated algorithms. We implement the GMRES polynomial as a preconditioner in the software library Trilinos and demonstrate that it is stable and effective for parallel computing. Trade-offs when selecting a polynomial degree and combining with other preconditioners are analyzed. We also discuss communication-avoiding (CA) properties of the polynomial and relate these to current CA-GMRES methods.
BibTeX:
@incollection{Loe2020,
  author = {Jennifer A. Loe and Heidi K. Thornquist and Erik G. Boman},
  title = {Polynomial Preconditioned GMRES in Trilinos: Practical Considerations for High-Performance Computing},
  booktitle = {Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing},
  publisher = {Society for Industrial and Applied Mathematics},
  year = {2020},
  pages = {35--45},
  doi = {10.1137/1.9781611976137.4}
}
Lopez F, Chow E, Tomov S and Dongarra J (2020), "Asynchronous SGD for DNN training on Shared-memory Parallel Architectures", In Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops., 5, 2020. IEEE.
Abstract: We present a parallel asynchronous Stochastic Gradient Descent algorithm for shared memory architectures. Different from previous asynchronous algorithms, we consider the case where the gradient updates are not particularly sparse. In the context of the MagmaDNN framework, we compare the parallel efficiency of the asynchronous implementation with that of the traditional synchronous implementation. Tests are performed for training deep neural networks on multicore CPUs and GPU devices.
BibTeX:
@inproceedings{Lopez2020,
  author = {Florent Lopez and Edmond Chow and Stanimire Tomov and Jack Dongarra},
  title = {Asynchronous SGD for DNN training on Shared-memory Parallel Architectures},
  booktitle = {Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops},
  publisher = {IEEE},
  year = {2020},
  doi = {10.1109/ipdpsw50202.2020.00168}
}
Lourenco C, Chen J, Moreno-Centeno E and Davis TA (2020), "User Guide for SLIP LU, A Sparse Left-Looking Integer-Preserving LU Factorization"
Abstract: SLIP LU is a software package designed to exactly solve unsymmetric sparse linear systems, Ax = b, where A ∊ &Qopf;^n × n , b ∊ &Qopf;^n × r , and x ∊ &Qopf;^n × r . This package performs a left-looking, roundoff-error-free (REF) LU factorization PAQ = LDU, where L and U are integer, D is diagonal, and P and Q are row and column permutations, respectively. Note that the matrix D is never explicitly computed nor needed; thus this package uses only the matrices L and U. The theory associated with this code is the Sparse Left-looking Integer-Preserving (SLIP) LU factorization [8]. Aside from solving sparse linear systems exactly, one of the key goals of this package is to provide a framework for other solvers to benchmark the reliability and stability of their linear solvers, as our final solution vector x is guaranteed to be exact. In addition, SLIP LU provides a wrapper class for the GNU Multiple Precision Arithmetic (GMP) and GNU Multiple Precision Floating Point Reliable (MPFR) libraries in order to prevent memory leaks and improve the overall stability of these external libraries. SLIP LU is written in ANSI C and is accompanied by a MATLAB interface.
BibTeX:
@manual{Lourenco2020,
  author = {Christopher Lourenco and Jinhao Chen and Erick Moreno-Centeno and Timothy A. Davis},
  title = {User Guide for SLIP LU, A Sparse Left-Looking Integer-Preserving LU Factorization},
  year = {2020}
}
Lu Y, Yamazaki I, Ino F, Matsushita Y, StanimireTomov and Dongarra J (2020), "Reducing the Amount of Out-of-Core Data Access for GPU-Accelerated Randomized SVD"
Abstract: We propose two acceleration methods, namely Fused and Gram, for reducing outof-core data access when performing randomized singular value decomposition (RSVD) on graphics processing units (GPUs). Out-of-core data here are data that are too large to fit into the GPU memory at once. Both methods accelerate GPU-enabled RSVD using the following three schemes: (1) a highly tuned general matrix-matrix multiplication (GEMM) scheme for processing out-of-core data on GPUs; (2) a data-access reduction scheme based on one-dimensional (1D) data partition; and (3) a first-in, first-out (FIFO) scheme that reduces CPU-GPU data transfer using the reverse iteration. The Fused method further reduces the amount of out-of-core data access by merging two GEMM operations into a single operation. In contrast, the Gram method reduces both in-core and out-of-core data access by explicitly forming the Gram matrix. According to our experimental results, the Fused and Gram methods improved the RSVD performance by up to 1.7× and 5.2×, respectively, compared with a straightforward method that deploys schemes (1) and (2) on the GPU. In addition, we present a case study of deploying the Gram method for accelerating robust principal component analysis (RPCA), a convex optimization problem in machine learning.
BibTeX:
@article{Lu2020,
  author = {Yuechao Lu and Ichitaro Yamazaki and Fumihiko Ino and Yasuyuki Matsushita and StanimireTomov and Jack Dongarra},
  title = {Reducing the Amount of Out-of-Core Data Access for GPU-Accelerated Randomized SVD},
  year = {2020}
}
Lu Z, Niu Y and Liu W (2020), "Efficient Block Algorithms for Parallel Sparse Triangular Solve", In Proceedings of the 49th International Conference on Parallel Processing., 8, 2020. ACM.
Abstract: The sparse triangular solve (SpTRSV) kernel is an important building block for a number of linear algebra routines such as sparse direct and iterative solvers. The major challenge of accelerating SpTRSV lies in the difficulties of finding higher parallelism. Existing work mainly focuses on reducing dependencies and synchronizations in the level-set methods. However, the 2D block layout of the input matrix has been largely ignored in designing more efficient SpTRSV algorithms.\ In this paper, we implement three block algorithms, i.e., column block, row block and recursive block algorithms, for parallel SpTRSV on modern GPUs, and propose an adaptive approach that can automatically select the best kernels according to input sparsity structures. By testing 159 sparse matrices on two high-end NVIDIA GPUs, the experimental results demonstrate that the recursive block algorithm has the best performance among the three block algorithms, and it is on average 4.72× (up to 72.03×) and 9.95× (up to 61.08×) faster than cuSPARSE v2 and Sync-free methods, respectively. Besides, our method merely needs moderate cost for preprocessing the input matrix, thus is highly efficient for multiple right-hand sides and iterative scenarios.
BibTeX:
@inproceedings{Lu2020a,
  author = {Zhengyang Lu and Yuyao Niu and Weifeng Liu},
  title = {Efficient Block Algorithms for Parallel Sparse Triangular Solve},
  booktitle = {Proceedings of the 49th International Conference on Parallel Processing},
  publisher = {ACM},
  year = {2020},
  doi = {10.1145/3404397.3404413}
}
Lubbe R, Xu W-J, Wilke DN, Pizette P and Govender N (2020), "Analysis of parallel spatial partitioning algorithms for GPU based DEM", Computers and Geotechnics., 9, 2020. Vol. 125, pp. 103708. Elsevier BV.
Abstract: The capability of solving a geotechnical discrete element method (DEM) applications is determined by the complexity of the simulation and its computational requirements. Collision detection algorithms are fundamental to resolve the mechanical collisions between millions of particles efficiently. These algorithms are a bottleneck for many DEM applications resulting in excessive memory usage or poor computational performance. In particular, for GPU based DEM, there are many factors for a user to consider when deciding on an algorithm. This study discusses a set of diverse classes of geotechnical problems and the impact of algorithm choice. Four factors were considered: i) the world domain size, number of particles and particle density, ii) polydispersity in size, iii) the time evolution and iv) the particle shape. This study shows that for spherical particles, the choice of broad-phase collision detection algorithm has the most impact on computational performance. The computational cost for convex polyhedral particles is dominated by the selection of the particles' bounding volumes and their intersection tests over the selection of the broad-phase collision detection algorithm. On average for convex polyhedral particles, the broad-phase occupies at most 1.3% of the total runtime, while the narrow-phase collision detection and collision response require more than 87% of the runtime. A combination of bounding spheres and axis-aligned bounding boxes for use as bounding volumes of particles showed the best performance reducing the computational cost by 20%. This study serves as a guide for further research in the field of GPU based DEM collision detection and the application in geotechnics.
BibTeX:
@article{Lubbe2020,
  author = {Retief Lubbe and Wen-Jie Xu and Daniel N. Wilke and Patrick Pizette and Nicolin Govender},
  title = {Analysis of parallel spatial partitioning algorithms for GPU based DEM},
  journal = {Computers and Geotechnics},
  publisher = {Elsevier BV},
  year = {2020},
  volume = {125},
  pages = {103708},
  doi = {10.1016/j.compgeo.2020.103708}
}
Luo X and Xu X (2020), "Regularized asymptotic descents for nonconvex optimization", April, 2020.
Abstract: In this paper we propose regularized asymptotic descent (RAD) methods for solving nonconvex optimization problems. Our motivation is first to apply the regularized iteration and then to use an explicit asymptotic formula to approximate the solution of each regularized minimization. We consider a class of possibly nonconvex, nonsmooth, or even discontinuous objectives extended from strongly convex functions with Lipschitz-continuous gradients, in each of which has a unique global minima and is continuously differentiable at the global minimizer. The main theoretical result shows that the RAD method enjoys the global linear convergence with high probability for such a class of nonconvex objectives, i.e., the method will not be trapped in saddle points, local minima, or even discontinuities. Besides, the method is derivative-free and its per-iteration cost, i.e., the number of function evaluations, is bounded, so that it has a complexity bound &Oscr;(log1𝜖) for finding a point such that the optimality gap at this point is less than >0.
BibTeX:
@article{Luo2020,
  author = {Xiaopeng Luo and Xin Xu},
  title = {Regularized asymptotic descents for nonconvex optimization},
  year = {2020}
}
Ma L, Ye J and Solomonik E (2020), "AutoHOOT: Automatic High-Order Optimization for Tensors", May, 2020.
Abstract: High-order optimization methods, including Newton's method and its variants as well as alternating minimization methods, dominate the optimization algorithms for tensor decompositions and tensor networks. These tensor methods are used for data analysis and simulation of quantum systems. In this work, we introduce AutoHOOT, the first automatic differentiation (AD) framework targeting at high-order optimization for tensor computations. AutoHOOT takes input tensor computation expressions and generates optimized derivative expressions. In particular, AutoHOOT contains a new explicit Jacobian / Hessian expression generation kernel whose outputs maintain the input tensors' granularity and are easy to optimize. The expressions are then optimized by both the traditional compiler optimization techniques and specific tensor algebra transformations. Experimental results show that AutoHOOT achieves competitive performance for both tensor decomposition and tensor network applications compared to existing AD software and other tensor computation libraries with manually written kernels, both on CPU and GPU architectures. The scalability of the generated kernels is as good as other well-known high-order numerical algorithms so that it can be executed efficiently on distributed parallel systems.
BibTeX:
@article{Ma2020,
  author = {Linjian Ma and Jiayu Ye and Edgar Solomonik},
  title = {AutoHOOT: Automatic High-Order Optimization for Tensors},
  year = {2020}
}
Madani R, Kheirandishfard M, Lavaei J and Atamturk A (2020), "Penalized Semidefinite Programming for QuadraticallyConstrained Quadratic Optimization"
Abstract: In this paper, we give a new penalized semidefinite programming approach for non-convex quadratically-constrained quadratic programs (QCQPs). We incorporate penalty terms into the objective of convex relaxations in order to retrieve feasible and near-optimal solutions for non-convex QCQPs. We introduce a generalized linear independence constraint qualification (GLICQ) criterion and prove that any GLICQ regular point that is sufficiently close to the feasible set can be used to construct an appropriate penalty term and recover a feasible solution. As a consequence, we describe a heuristic sequential procedure that preserves feasibility and aims to improve the bjective value at each iteration. Numerical experiments on large-scale system identification problems as well as benchmark instances from the library of quadratic programming (QPLIB) demonstrate the ability of the proposed penalized semidefinite programs in finding near-optimal solutions for non-convex QCQP.
BibTeX:
@article{Madani2020,
  author = {Ramtin Madani and Mohsen Kheirandishfard and Javad Lavaei and Alper Atamturk},
  title = {Penalized Semidefinite Programming for QuadraticallyConstrained Quadratic Optimization},
  year = {2020},
  url = {https://www.ocf.berkeley.edu/ madani/paper/penalized_sdp.pdf}
}
Madsen JR, Awan MG, Brunie H, Deslippe J, Gayatri R, Oliker L, Wang Y, Yang C and Williams S (2020), "Timemory: Modular Performance Analysis for HPC", In Lecture Notes in Computer Science. , pp. 434-452. Springer International Publishing.
Abstract: HPC has undergone a significant transition toward heterogeneous architectures. This transition has introduced several issues in code migration to support multiple frameworks for targeting the various architectures. In order to cope with these challenges, projects such as Kokkos and LLVM create abstractions which map a generic front-end API to the backend that supports the targeted architecture. This paper presents a complementary framework for performance measurement and analysis. Several performance measurement and analysis tools in existence provide their capabilities through various methods but the common theme among these tools are prohibitive limitations in terms of user-level extensions. For this reason, software developers commonly have to learn multiple tools and valuable analysis methods, such as the roofline model, are frequently required to be generated manually. The timemory framework provides complete modularity for performance measurement and analysis and eliminates all restrictions on user-level extensions. The timemory framework also provides a highly-efficient and intuitive method for handling multiple tools/measurements (i.e., "components") concurrently. The intersection of these characteristics provide ample evidence that timemory can serve as the common interface for existing performance measurement and analysis tools. Timemory components are developed in C++ but includes multi-language support for C, Fortran, and Python codes. Numerous components are provided by the library itself – including, but not limited to, timers, memory usage, hardware counters, and FLOP and instruction roofline models. Additionally, analysis of the intrinsic overhead demonstrates superior performance in comparison with popular tools.
BibTeX:
@incollection{Madsen2020,
  author = {Jonathan R. Madsen and Muaaz G. Awan and Hugo Brunie and Jack Deslippe and Rahul Gayatri and Leonid Oliker and Yunsong Wang and Charlene Yang and Samuel Williams},
  title = {Timemory: Modular Performance Analysis for HPC},
  booktitle = {Lecture Notes in Computer Science},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {434--452},
  doi = {10.1007/978-3-030-50743-5_22}
}
Mai NHA, Magron V and Lasserre J-B (2020), "A sparse version of Reznick's Positivstellensatz", February, 2020.
Abstract: If f is a positive definite form, Reznick's Positivstellensatz [Mathematische Zeitschrift. 220 (1995), pp. 75--97] states that there exists k∊N such that x ^2k_2f is a sum of squares of polynomials. Assuming that f can be written as a sum of forms _l=1^p f_l, where each f_l depends on a subset of the initial variables, and assuming that these subsets satisfy the so-called running intersection property, we provide a sparse version of Reznick's Positivstellensatz. Namely, there exists k ∊ N such that f=_l = 1^p _l/H_l^k, where _l is a sum of squares of polynomials, H_l is a uniform polynomial denominator, and both polynomials _l,H_l involve the same variables as f_l, for each l=1,,p. In other words, the sparsity pattern of f is also reflected in this sparse version of Reznick's certificate of positivity. We next use this result to also obtain positivity certificates for (i) polynomials nonnegative on the whole space and (ii) polynomials nonnegative on a (possibly non-compact) basic semialgebraic set, assuming that the input data satisfy the running intersection property. Both are sparse versions of a positivity certificate due to Putinar and Vasilescu.
BibTeX:
@article{Mai2020,
  author = {Ngoc Hoang Anh Mai and Victor Magron and Jean-Bernard Lasserre},
  title = {A sparse version of Reznick's Positivstellensatz},
  year = {2020}
}
Majumdar A, Hall G and Ahmadi AA (2020), "Recent Scalability Improvements for Semidefinite Programming with Applications in Machine Learning, Control, and Robotics", Annual Review of Control, Robotics, and Autonomous Systems. Vol. 3(1)
Abstract: Historically, scalability has been a major challenge for the successful application of semidefinite programming in fields such as machine learning, control, and robotics. In this article, we survey recent approaches to this challenge, including those that exploit structure (e.g., sparsity and symmetry) in a problem, those that produce low-rank approximate solutions to semidefinite programs, those that use more scalable algorithms that rely on augmented Lagrangian techniques and the alternating-direction method of multipliers, and those that trade off scalability with conservatism (e.g., by approximating semidefinite programs with linear and second-order cone programs). For each class of approaches, we provide a high-level exposition, an entry point to the corresponding literature, and examples drawn from machine learning, control, or robotics. We also present a list of software packages that implement many of the techniques discussed in the review. Our hope is that this article will serve as a gateway to the rich and exciting literature on scalable semidefinite programming for both theorists and practitioners.
BibTeX:
@article{Majumdar2020,
  author = {Majumdar, Anirudha and Hall, Georgina and Ahmadi, Amir Ali},
  title = {Recent Scalability Improvements for Semidefinite Programming with Applications in Machine Learning, Control, and Robotics},
  journal = {Annual Review of Control, Robotics, and Autonomous Systems},
  year = {2020},
  volume = {3},
  number = {1},
  doi = {10.1146/annurev-control-091819-074326}
}
Mangoubi O and Vishnoi NK (2020), "A Second-order Equilibrium in Nonconvex-Nonconcave Min-max Optimization: Existence and Algorithm", June, 2020.
Abstract: Min-max optimization, with a nonconvex-nonconcave objective function f: &reals;^d × &reals;^d → &reals;, arises in many areas, including optimization, economics, and deep learning. The nonconvexity-nonconcavity of f means that the problem of finding a global varepsilon-min-max point cannot be solved in poly(d, 1varepsilon) evaluations of f. Thus, most algorithms seek to obtain a certain notion of local min-max point where, roughly speaking, each player optimizes her payoff in a local sense. However, the classes of local min-max solutions which prior algorithms seek are only guaranteed to exist under very strong assumptions on f, such as convexity or monotonicity. We propose a notion of a greedy equilibrium point for min-max optimization and prove the existence of such a point for any function such that it and its first three derivatives are bounded. Informally, we say that a point (x^, y^) is an varepsilon-greedy min-max equilibrium point of a function f: &reals;^d × &reals;^d → &reals; if y^star is a second-order local maximum for f(x^,) and, roughly, x^star is a local minimum for a greedy optimization version of the function _y f(x,y) which can be efficiently estimated using greedy algorithms. The existence follows from an algorithm that converges from any starting point to such a point in a number of gradient and function evaluations that is polynomial in 1varepsilon, the dimension d, and the bounds on f and its first three derivatives. Our results do not require convexity, monotonicity, or special starting points.
BibTeX:
@article{Mangoubi2020,
  author = {Oren Mangoubi and Nisheeth K. Vishnoi},
  title = {A Second-order Equilibrium in Nonconvex-Nonconcave Min-max Optimization: Existence and Algorithm},
  year = {2020}
}
Manguoğlu M, Polizzi E and Sameh AH (2020), "Parallel Hybrid Sparse Linear System Solvers", In Parallel Algorithms in Computational Science and Engineering. , pp. 95-120. Springer International Publishing.
Abstract: In this chapter, we present the SPIKE family of algorithms for solving banded linear systems and its multithreaded implementation as well as direct-iterative hybrid variants for solving general sparse linear system of equations.
BibTeX:
@incollection{Manguoglu2020,
  author = {Murat Manguoğlu and Eric Polizzi and Ahmed H. Sameh},
  title = {Parallel Hybrid Sparse Linear System Solvers},
  booktitle = {Parallel Algorithms in Computational Science and Engineering},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {95--120},
  doi = {10.1007/978-3-030-43736-7_4}
}
Marchal L, Marette T, Pichon G and Vivien F (2020), "Trading Performance for Memoryin Sparse Direct Solvers using Low-rank Compression". Thesis at: Inria.
Abstract: Sparse direct solvers using Block Low-Rank compression have been proven efficient to solve problems arising in many real-life applications. Improving those solvers is crucial for being able to 1) solve larger problems and 2) speed up computations. A main characteristic of a sparse direct solver using low-rank compression is when compression is performed. There are two distinct approaches: (1) all blocks are compressed before starting the factorization, which reduces the memory as much as possible, or (2) each block is compressed as late as possible, which usually leads to better speedup. The objective of this paper is to design a composite approach, to speedup computations while staying under a given memory limit. This should allow to solve large problems that cannot be solved with Approach 2 while reducing the execution time compared to Approach 1. We propose a memory-aware strategy where each block can be compressed either at the beginning or as late as possible. We first consider the problem of choosing when to compress each block, under the assumption that all information on blocks is perfectly known, i.e., memory requirement and execution time of a block when compressed or not. We show that this problem is a variant of the NP-complete Knapsack problem, and adapt an existing 2-approximation algorithm for our problem. Unfortunately, the required information on blocks depends on numerical properties and in practice cannot be known in advance. We thus introduce models to estimate those values. Experiments on the PaStiX solver demonstrate that our new approach can achieve an excellent trade-off between memory consumption and computational cost. For instance on matrix Geo1438, Approach 2 uses three times as much memory as Approach 1 while being three times faster. Our new approach leads to an execution time only 30% larger than Approach 2 when given a memory 30% larger than the one needed by Approach 1
BibTeX:
@techreport{Marchal2020,
  author = {Loris Marchal and Thibault Marette and Grégoire Pichon and Frédéric Vivien},
  title = {Trading Performance for Memoryin Sparse Direct Solvers using Low-rank Compression},
  school = {Inria},
  year = {2020},
  url = {https://hal.inria.fr/hal-02976233/document}
}
Marin O, Constantinescu E and Smith B (2020), "A scalable matrix-free spectral element approach for unsteady PDE constrainedoptimization using PETSc/TAO", Journal of Computational Science., 9, 2020. , pp. 101207. Elsevier BV.
Abstract: We provide a new approach for the efficient matrix-free application of the transpose of the Jacobian for the spectral element method for the adjoint-based solution of partial differential equation (PDE) constrained optimization. This results in optimizations of nonlinear PDEs using explicit integrators where the integration of the adjoint problem is not more expensive than the forward simulation. Solving PDE constrained optimization problems entails combining expertise from multiple areas, including simulation, computation of derivatives, and optimization. The Portable, Extensible Toolkit for Scientific computation (PETSc) together with its companion package, the Toolkit for Advanced Optimization (TAO), is an integrated numerical software library that contains an algorithmic/software stack for solving linear systems, nonlinear systems, ordinary differential equations, differential algebraic equations, and large-scale optimization problems and, as such, is an ideal tool for performing PDE-constrained optimization. This paper describes an efficient approach in which the software stack provided by PETSc/TAO can be used for large-scale nonlinear time-dependent problems. While time integration can involve a range of high-order methods, both implicit and explicit. The PDE-constrained optimization algorithm used is gradient-based and seamlessly integrated with the simulation of the physical problem.
BibTeX:
@article{Marin2020,
  author = {Oana Marin and Emil Constantinescu and Barry Smith},
  title = {A scalable matrix-free spectral element approach for unsteady PDE constrainedoptimization using PETSc/TAO},
  journal = {Journal of Computational Science},
  publisher = {Elsevier BV},
  year = {2020},
  pages = {101207},
  doi = {10.1016/j.jocs.2020.101207}
}
Martinsson P-G and Tropp J (2020), "Randomized Numerical Linear Algebra: Foundations & Algorithms", February, 2020.
Abstract: This survey describes probabilistic algorithms for linear algebra computations, such as factorizing matrices and solving linear systems. It focuses on techniques that have a proven track record for real-world problem instances. The paper treats both the theoretical foundations of the subject and the practical computational issues.\ Topics covered include norm estimation; matrix approximation by sampling; structured and unstructured random embeddings; linear regression problems; low-rank approximation; subspace iteration and Krylov methods; error estimation and adaptivity; interpolatory and CUR factorizations; Nyström approximation of positive-semidefinite matrices; single view ("streaming") algorithms; full rank-revealing factorizations; solvers for linear systems; and approximation of kernel matrices that arise in machine learning and in scientific computing.
BibTeX:
@article{Martinsson2020,
  author = {Per-Gunnar Martinsson and Joel Tropp},
  title = {Randomized Numerical Linear Algebra: Foundations & Algorithms},
  year = {2020}
}
Mei J, Xiao C, Szepesvari C and Schuurmans D (2020), "On the Global Convergence Rates of Softmax Policy Gradient Methods", May, 2020.
Abstract: We make three contributions toward better understanding policy gradient methods in the tabular setting. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a O(1/t) rate, with constants depending on the problem and initialization. This result significantly expands the recent asymptotic convergence results. The analysis relies on two findings: that the softmax policy gradient satisfies a &Lstrok;ojasiewicz inequality, and the minimum probability of an optimal action during optimization can be bounded in terms of its initial value. Second, we analyze entropy regularized policy gradient and show that it enjoys a significantly faster linear convergence rate O(e^-t) toward softmax optimal policy. This result resolves an open question in the recent literature. Finally, combining the above two results and additional new (1/t) lower bound results, we explain how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate. The separation of rates is further explained using the notion of non-uniform &Lstrok;ojasiewicz degree. These results provide a theoretical understanding of the impact of entropy and corroborate existing empirical studies.
BibTeX:
@article{Mei2020,
  author = {Jincheng Mei and Chenjun Xiao and Csaba Szepesvari and Dale Schuurmans},
  title = {On the Global Convergence Rates of Softmax Policy Gradient Methods},
  year = {2020}
}
Melo JG, Monteiro RDC and Wang H (2020), "Iteration-complexity of an inexact proximal acceleratedaugmented Lagrangian method for solving linearly constrainedsmooth nonconvex composite optimization problems"
Abstract: This paper proposes and establishes the iteration-complexity of an inexact proximal accelerated augmented Lagrangian (IPAAL) method for solving linearly constrained smooth nonconvex composite optimization problems. Each IPAAL iteration consists of inexactly solving a proximal augmented Lagrangian subproblem by an accelerated composite gradient (ACG) method followed by a suitable Lagrange multiplier update. It is shown that IPAAL generates an approximate stationary solution in at most O(log(1/)/3) ACG iterations, where ρ > 0 is the given tolerance. It is also shown that the previous complexity bound can be sharpened to O(log(1/)/2.5) under additional mildly stronger assumptions. The above bounds are derived assuming that the initial point is neither feasible nor the domain of the composite term of the objective function is bounded. Some preliminary numerical results are presented to illustrate the performance of the IPAAL method.
BibTeX:
@article{Melo2020,
  author = {Melo, Jefferson G. and Monteiro, Renato D. C. and Wang, Hairong},
  title = {Iteration-complexity of an inexact proximal acceleratedaugmented Lagrangian method for solving linearly constrainedsmooth nonconvex composite optimization problems},
  year = {2020}
}
Mendler-Dünner C and Lucchi A (2020), "Randomized Block-Diagonal Preconditioning for Parallel Learning", June, 2020.
Abstract: We study preconditioned gradient-based optimization methods where the preconditioning matrix has block-diagonal form. Such a structural constraint comes with the advantage that the update computation can be parallelized across multiple independent tasks. Our main contribution is to demonstrate that the convergence of these methods can significantly be improved by a randomization technique which corresponds to repartitioning coordinates across tasks during the optimization procedure. We provide a theoretical analysis that accurately characterizes the expected convergence gains of repartitioning and validate our findings empirically on various traditional machine learning tasks. From an implementation perspective, block separable models are well suited for parallelization and, when shared memory is available, randomization can be implemented on top of existing methods very efficiently to improve convergence.
BibTeX:
@article{MendlerDuenner2020,
  author = {Celestine Mendler-Dünner and Aurelien Lucchi},
  title = {Randomized Block-Diagonal Preconditioning for Parallel Learning},
  year = {2020}
}
Menon H, Bhatele A and Gamblin T (2020), "Auto-tuning Parameter Choices in HPC Applications using Bayesian Optimization", In Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium.
Abstract: High performance computing applications, runtimes, and platforms are becoming more configurable to enable applications to obtain better performance. As a result, users are increasingly presented with a multitude of options to configure application-specific as well as platform-level parameters. The combined effect of different parameter choices on application performance is difficult to predict, and an exhaustive evaluation of this combinatorial parameter space is practically infeasible. One approach to parameter selection is a user-guided exploration of a part of the space. However, such an ad hoc exploration of the parameter space can result in suboptimal choices. Therefore, an automatic approach that can efficiently explore the parameter space is needed. In this paper, we propose HiPerBOt, a Bayesian optimization based configuration selection framework to identify application and platform-level parameters that result in high performing configurations. We demonstrate the effectiveness of HiPerBOt in tuning parameters that include compiler flags, runtime settings, and application-level options for several parallel codes, including, Kripke, Hypre, LULESH, and OpenAtom.
BibTeX:
@inproceedings{Menon2020,
  author = {Harshitha Menon and Abhinav Bhatele and Todd Gamblin},
  title = {Auto-tuning Parameter Choices in HPC Applications using Bayesian Optimization},
  booktitle = {Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium},
  year = {2020}
}
Mentus C and Roper M (2020), "Optimal Mixing in Transport Networks: Numerical Optimization and Analysis", July, 2020.
Abstract: Many foraging microorganisms rely upon cellular transport networks to deliver nutrients, fluid and organelles between different parts of the organism. Networked organisms ranging from filamentous fungi to slime molds demonstrate a remarkable ability to mix or disperse molecules and organelles in their transport media. Here we introduce mathematical tools to analyze the structure of energy efficient transport networks that maximize mixing and sending signals originating from and arriving at each node. We define two types of entropies on flows to quantify mixing and develop numerical algorithms to optimize the combination of entropy and energy on networks, given constraints on the amount of available material. We present an in-depth exploration of optimal single source-sink networks on finite triangular grids, a fundamental setting for optimal transport networks in the plane. Using numerical simulations and rigorous proofs, we show that, if the constraint on conductances is strict, the optimal networks are paths of every possible length. If the constraint is relaxed, our algorithm produces loopy networks that fan out at the source and pour back into a single path that flows to the sink. Taken together, our results expand the class of optimal transportation networks that can be compared with real biological data, and highlight how real network morphologies may be shaped by tradeoffs between transport efficiency and the need to mix the transported matter.
BibTeX:
@article{Mentus2020,
  author = {Cassidy Mentus and Marcus Roper},
  title = {Optimal Mixing in Transport Networks: Numerical Optimization and Analysis},
  year = {2020}
}
Miasnikof P, Hong S and Lawryshyn Y (2020), "Graph Clustering Via QUBO and Digital Annealing", March, 2020.
Abstract: This article empirically examines the computational cost of solving a known hard problem, graph clustering, using novel purpose-built computer hardware. We express the graph clustering problem as an intra-cluster distance or dissimilarity minimization problem. We formulate our poblem as a quadratic unconstrained binary optimization problem and employ a novel computer architecture to obtain a numerical solution. Our starting point is a clustering formulation from the literature. This formulation is then converted to a quadratic unconstrained binary optimization formulation. Finally, we use a novel purpose-built computer architecture to obtain numerical solutions. For benchmarking purposes, we also compare computational performances to those obtained using a commercial solver, Gurobi, running on conventional hardware. Our initial results indicate the purpose-built hardware provides equivalent solutions to the commercial solver, but in a very small fraction of the time required.
BibTeX:
@article{Miasnikof2020,
  author = {Pierre Miasnikof and Seo Hong and Yuri Lawryshyn},
  title = {Graph Clustering Via QUBO and Digital Annealing},
  year = {2020}
}
Mills RT, Adams MF, Balay S, Brown J, Dener A, Knepley M, Kruger SE, Morgan H, Munson T, Rupp K, Smith BF, Zampini S, Zhang H and Zhang J (2020), "Toward Performance-Portable PETSc for GPU-based Exascale Systems", November, 2020.
Abstract: The Portable Extensible Toolkit for Scientific computation (PETSc) library delivers scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization.The PETSc design for performance portability addresses fundamental GPU accelerator challenges and stresses flexibility and extensibility by separating the programming model used by the application from that used by the library, and it enables application developers to use their preferred programming model, such as Kokkos, RAJA, SYCL, HIP, CUDA, or OpenCL, on upcoming exascale systems. A blueprint for using GPUs from PETSc-based codes is provided, and case studies emphasize the flexibility and high performance achieved on current GPU-based systems.
BibTeX:
@article{Mills2020,
  author = {Richard Tran Mills and Mark F. Adams and Satish Balay and Jed Brown and Alp Dener and Matthew Knepley and Scott E. Kruger and Hannah Morgan and Todd Munson and Karl Rupp and Barry F. Smith and Stefano Zampini and Hong Zhang and Junchao Zhang},
  title = {Toward Performance-Portable PETSc for GPU-based Exascale Systems},
  year = {2020}
}
Mishra A, Kirmani S and Madduri K (2020), "Fast Spectral Graph Layout on Multicore Platforms", In Proceedings of the 49th International Conference on Parallel Processing., 8, 2020. ACM.
Abstract: We present ParHDE, a shared-memory parallelization of the High-Dimensional Embedding (HDE) graph algorithm. Originally proposed as a graph drawing algorithm, HDE characterizes the global structure of a graph and is closely related to spectral graph computations such as computing the eigenvectors of the graph Laplacian. We identify compute- and memory-intensive steps in HDE and parallelize these steps for efficient execution on shared-memory multicore platforms. ParHDE can process graphs with billions of edges in minutes, is up to 18× faster than a prior parallel implementation of HDE, and achieves up to a 24× relative speedup on a 28-core system. We also implement several extensions of ParHDE and demonstrate its utility in diverse graph computation-related applications.
BibTeX:
@inproceedings{Mishra2020,
  author = {Ashirbad Mishra and Shad Kirmani and Kamesh Madduri},
  title = {Fast Spectral Graph Layout on Multicore Platforms},
  booktitle = {Proceedings of the 49th International Conference on Parallel Processing},
  publisher = {ACM},
  year = {2020},
  doi = {10.1145/3404397.3404471}
}
Mo T and Li R (2020), "Iteratively solving sparse linear system based on PaRSEC task scheduling", The International Journal of High Performance Computing Applications., 1, 2020. , pp. 109434201989999. SAGE Publications.
Abstract: With the new architecture and new programming paradigms such as task-based scheduling emerging in the parallel high performance computing area, it is of great importance to utilize these features to tune the monolithic computing codes. In this article, the classical conjugate gradient algorithms targeting at sparse linear system Ax = b in Krylov subspace are pipelining to execute interdependent tasks on Parallel Runtime Scheduling and Execution Controller (PaRSEC) runtime. Firstly, the sparse matrix A is split in rows to unfold more coarse-grained parallelism. Secondly, the partitioned sub-vectors are not assembled into one full vector in RAM to run sparse matrix--vector product (SpMV) operations for eliminating the communication overhead. Moreover, in the SpMV computation, if all elements of one column in the split sub-matrix are zeros, the corresponding product operations of these elements may be removed by reorganizing sub-vectors. Finally, the latency of migrating sub-vector is partially overlapped by the duration of performing SpMV operations through the further splitting in columns of sparse matrix on GPUs. In experiments, a series of tests demonstrate that optimal speedup and higher pipelining efficiency has been achieved for the pipelined task scheduling on PaRSEC runtime. Fusing SpMV concurrency and dot product pipelining can achieve higher speedup and efficiency.
BibTeX:
@article{Mo2020,
  author = {Tieqiang Mo and Renfa Li},
  title = {Iteratively solving sparse linear system based on PaRSEC task scheduling},
  journal = {The International Journal of High Performance Computing Applications},
  publisher = {SAGE Publications},
  year = {2020},
  pages = {109434201989999},
  doi = {10.1177/1094342019899997}
}
Möller M and Schalkers M (2020), "|Lib> : A Cross-Platform Programming Framework for Quantum-Accelerated Scientific Computing", In Lecture Notes in Computer Science. , pp. 451-464. Springer International Publishing.
Abstract: This paper introduces a new cross-platform programming framework for developing quantum-accelerated scientific computing applications and executing them on most of today's cloud-based quantum computers and simulators. It makes use of C++ template meta-programming techniques to implement quantum algorithms as generic, platform-independent expressions, which get automatically synthesized into device-specific compute kernels upon execution. Our software framework supports concurrent and asynchronous execution of multiple quantum kernels via a CUDA-inspired stream concept.
BibTeX:
@incollection{Moeller2020,
  author = {Matthias Möller and Merel Schalkers},
  title = {|Lib> : A Cross-Platform Programming Framework for Quantum-Accelerated Scientific Computing},
  booktitle = {Lecture Notes in Computer Science},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {451--464},
  doi = {10.1007/978-3-030-50433-5_35}
}
Mofrad MH (2020), "Distributed Sparse Computing and Communication for Big Graph Analytics and Deep Learning". Thesis at: University of Pittsburgh.
Abstract: Sparsity can be found in the underlying structure of many real-world computationally expensive problems including big graph analytics and large scale sparse deep neural networks. In addition, if gracefully investigated, many of these problems contain a broad substratum of parallelism suitable for parallel and distributed executions of sparse computation. However, usually, dense computation is preferred to its sparse alternative as sparse computation is not only hard to parallelize due to the irregular nature of the sparse data, but also complicated to implement in terms of rewriting a dense algorithm into a sparse one. Hence, foolproof sparse computation requires customized data structures to encode the sparsity of the sparse data and new algorithms to mask the complexity of the sparse computation. However, by carefully exploiting the sparse data structures and algorithms, sparse computation can reduce memory consumption, communication volume, and processing power and thus undoubtedly move the scalability boundaries compared to its dense equivalent.\ In this dissertation, I explain how to use parallel and distributed computing techniques in the presence of sparsity to solve large scientific problems including graph analytics and deep learning. To meet this end goal, I leverage the duality between graph theory and sparse linear algebra primitives, and thus solve graph analytics and deep learning problems with the sparse matrix operations. My contributions are fourfold: (1) design and implementation of a new distributed compressed sparse matrix data structure that reduces both computation and communication volumes and is suitable for sparse matrix-vector and sparse matrix-matrix operations, (2) introducing the new MPI + X parallelism model that deems threads as basic units of computing and communication, (3) optimizing sparse matrix-matrix multiplication by employing different hashing techniques, and (4) proposing the new data-then-model parallelism that mitigates the effect of stragglers in sparse deep learning by combining data and model parallelisms. Altogether, these contributions provide a set of data structures and algorithms to accelerate and scale the sparse computing and communication.
BibTeX:
@phdthesis{Mofrad2020,
  author = {Mohammad Hasanzadeh Mofrad},
  title = {Distributed Sparse Computing and Communication for Big Graph Analytics and Deep Learning},
  school = {University of Pittsburgh},
  year = {2020},
  url = {http://d-scholarship.pitt.edu/39841/1/Thesis.pdf}
}
Mohammed T, Albeshri A, Katib I and Mehmood R (2020), "DIESEL: A novel deep learning-based tool for SpMV computations and solving sparse linear equation systems", The Journal of Supercomputing., November, 2020. Springer Science and Business Media LLC.
Abstract: Sparse linear algebra is central to many areas of engineering, science, and business. The community has done considerable work on proposing new methods for sparse matrix-vector multiplication (SpMV) computations and iterative sparse solvers on graphical processing units (GPUs). Due to vast variations in matrix features, no single method performs well across all sparse matrices. A few tools on automatic prediction of best-performing SpMV kernels have emerged recently and require many more efforts to fully utilize their potential. The utilization of a GPU by the existing SpMV kernels is far from its full capacity. Moreover, the development and performance analysis of SpMV techniques on GPUs have not been studied in sufficient depth. This paper proposes DIESEL, a deep learning-based tool that predicts and executes the best performing SpMV kernel for a given matrix using a feature set carefully devised by us through rigorous empirical and mathematical instruments. The dataset comprises 1056 matrices from 26 different real-life application domains including computational fluid dynamics, materials, electromagnetics, economics, and more. We propose a range of new metrics and methods for performance analysis, visualization, and comparison of SpMV tools. DIESEL provides better performance with its accuracy 88.2%, workload accuracy 91.96%, and average relative loss 4.4%, compared to 85.9%, 85.31%, and 7.65% by the next best performing artificial intelligence (AI)-based SpMV tool. The extensive results and analyses presented in this paper provide several key insights into the performance of the SpMV tools and how these relate to the matrix datasets and the performance metrics, allowing the community to further improve and compare basic and AI-based SpMV tools in the future.
BibTeX:
@article{Mohammed2020,
  author = {Thaha Mohammed and Aiiad Albeshri and Iyad Katib and Rashid Mehmood},
  title = {DIESEL: A novel deep learning-based tool for SpMV computations and solving sparse linear equation systems},
  journal = {The Journal of Supercomputing},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  doi = {10.1007/s11227-020-03489-3}
}
Mohanamuraly P and Staffelbach G (2020), "Hardware Locality-Aware Partitioning and Dynamic Load-Balancing of Unstructured Meshes for Large-Scale Scientific Applications", In Proceedings of the Platform for Advanced Scientific Computing Conference., 6, 2020. ACM.
Abstract: We present an open-source topology-aware hierarchical unstructured mesh partitioning and load-balancing tool TreePart. The framework provides powerful abstractions to automatically detect and build hierarchical MPI topology resembling the hardware at runtime. Using this information it intelligently chooses between shared and distributed parallel algorithms for partitioning and loadbalancing. It provides a range of partitioning methods by interfacing with existing shared and distributed memory parallel partitioning libraries. It provides powerful and scalable abstractions like onesided distributed dictionaries and MPI3 shared memory based halo communicators for optimising HPC codes. The tool was successfully integrated into our in-house code and we present results from a large-eddy simulation of a combustion problem.
BibTeX:
@inproceedings{Mohanamuraly2020,
  author = {Pavanakumar Mohanamuraly and Gabriel Staffelbach},
  title = {Hardware Locality-Aware Partitioning and Dynamic Load-Balancing of Unstructured Meshes for Large-Scale Scientific Applications},
  booktitle = {Proceedings of the Platform for Advanced Scientific Computing Conference},
  publisher = {ACM},
  year = {2020},
  doi = {10.1145/3394277.3401851}
}
Montoison A and Orban D (2020), "TriCG and TriMR: Two Iterative Methods for Symmetric Quasi-Definite Systems", August, 2020.
Abstract: We introduce iterative methods named TriCG and TriMR for solving symmetric quasi-definite systems based on the orthogonal tridiagonalization process proposed by Saunders, Simon and Yip in 1988. TriCG and TriMR are tantamount to preconditioned block-CG and block-MINRES with two right-hand sides in which the two approximate solutions are summed at each iteration, but require less storage and work per iteration. We evaluate the performance of TriCG and TriMR on linear systems generated from the SuiteSparse Matrix Collection and from discretized and stablized Stokes equations. We compare TriCG and TriMR with SYMMLQ and MINRES, the recommended Krylov methods for symmetric and indefinite systems. In all our experiments, TriCG and TriMR terminate earlier than SYMMLQ and MINRES on a residual-based stopping condition with an improvement of up to 50% in terms of number of iterations.
BibTeX:
@article{Montoison2020,
  author = {Alexis Montoison and Dominique Orban},
  title = {TriCG and TriMR: Two Iterative Methods for Symmetric Quasi-Definite Systems},
  year = {2020},
  doi = {10.13140/RG.2.2.12344.16645}
}
Mor U and Avron H (2020), "Solving Trust Region Subproblems Using Riemannian Optimization", October, 2020.
Abstract: The Trust Region Subproblem is a fundamental optimization problem that takes a pivotal role in Trust Region Methods. However, the problem, and variants of it, also arise in quite a few other applications. In this article, we present a family of globally convergent iterative Riemannian optimization algorithms for a variant of the Trust Region Subproblem that replaces the inequality constraint with an equality constraint. Our approach uses either a trivial or a non-trivial Riemannian geometry of the search-space, and requires only minimal spectral information about the quadratic component of the objective function. We further show how the theory of Riemannian optimization promotes a deeper understanding of the Trust Region Subproblem and its difficulties, e.g. a deep connection between the Trust Region Subproblem and the problem of finding affine eigenvectors, and a new examination of the so-called hard case in light of the condition number of the Riemannian Hessian operator at a global optimum. Finally, we propose to incorporate preconditioning via a careful selection of a variable Riemannian metric, and establish bounds on the asymptotic convergence rate in terms of how well the preconditioner approximates the input matrix.
BibTeX:
@article{Mor2020,
  author = {Uria Mor and Haim Avron},
  title = {Solving Trust Region Subproblems Using Riemannian Optimization},
  year = {2020}
}
Moríñigo JA, Garcı́a-Muller P, Rubio-Montero AJ, Gómez-Iglesias A, Meyer N and Mayo-Garcı́a R (2020), "Performance drop at executing communication-intensive parallel algorithms", The Journal of Supercomputing., 1, 2020. Springer Science and Business Media LLC.
Abstract: This work summarizes the results of a set of executions completed on three fat-tree network supercomputers: Stampede at TACC (USA), Helios at IFERC (Japan) and Eagle at PSNC (Poland). Three MPI-based, communication-intensive scientific applications compiled for CPUs have been executed under weak-scaling tests: the molecular dynamics solver LAMMPS; the finite element-based mini-kernel miniFE of NERSC (USA); and the three-dimensional fast Fourier transform mini-kernel bigFFT of LLNL (USA). The design of the experiments focuses on the sensitivity of the applications to rather different patterns of task location, to assess the impact on the cluster performance. The accomplished weak-scaling tests stress the effect of the MPI-based application mappings (concentrated vs. distributed patterns of MPI tasks over the nodes) on the cluster. Results reveal that highly distributed task patterns may imply a much larger execution time in scale, when several hundreds or thousands of MPI tasks are involved in the experiments. Such a characterization serves users to carry out further, more efficient executions. Also researchers may use these experiments to improve their scalability simulators. In addition, these results are useful from the clusters administration standpoint since tasks mapping has an impact on the cluster throughput.
BibTeX:
@article{Morinigo2020,
  author = {José A. Moríñigo and Pablo Garcı́a-Muller and Antonio J. Rubio-Montero and Antonio Gómez-Iglesias and Norbert Meyer and Rafael Mayo-Garcı́a},
  title = {Performance drop at executing communication-intensive parallel algorithms},
  journal = {The Journal of Supercomputing},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  doi = {10.1007/s11227-019-03142-8}
}
Moutafis BE, Gravvanis GA and Filelis-Papadopoulos CK (2020), "Hybrid multi-projection method using sparse approximate inverses on GPU clusters", The International Journal of High Performance Computing Applications., February, 2020. , pp. 109434202090563. SAGE Publications.
Abstract: The state-of-the-art supercomputing infrastructures are equipped with accelerators, such as graphics processing units (GPUs), that operate as coprocessors for each workstation of the distributed memory system. The multi-projection type methods are a class of algebraic domain decomposition methods based on semi-aggregation techniques. The multi-projection type methods have improved convergence behavior, as the number of subdomains increases, due to the corresponding augmentation of the semi-aggregated local linear systems with more coarse components, while the number of fine components is reduced. Moreover, limited amount of communications among the workstations is required by the proposed method. The utilization of the available GPUs allows an increase in the number of subdomains along with finer-grained parallelism, leading to improved performance. A load-balancing algorithm that ensures the concurrency of the computations on multicore processors and GPUs is proposed. Flexible parallel preconditioned Krylov subspace iterative methods enhanced with multi-projection type methods have been designed appropriately in order to have improved performance, compared to CPU-only or GPU-only executions, by exploiting the available CPUs and GPUs of the distributed memory system concurrently. The unsymmetric local linear systems are solved by the preconditioned Bi-Conjugate Gradient STABilized (BiCGSTAB) method enhanced with the modified generic factored approximate sparse inverse preconditioner, whereas the preconditioned conjugate gradient (CG) method along with the symmetric factored approximate sparse inverse preconditioner is used for the symmetric positive definite local coefficient matrices. Numerical results regarding the convergence behavior, the performance, and the scalability of the proposed method for several problems are given.
BibTeX:
@article{Moutafis2020,
  author = {Byron E Moutafis and George A Gravvanis and Christos K Filelis-Papadopoulos},
  title = {Hybrid multi-projection method using sparse approximate inverses on GPU clusters},
  journal = {The International Journal of High Performance Computing Applications},
  publisher = {SAGE Publications},
  year = {2020},
  pages = {109434202090563},
  doi = {10.1177/1094342020905637}
}
Moutafis BE, Gravvanis GA and Filelis-Papadopoulos CK (2020), "On the design of two-stage multiprojection methods for distributed memory systems", The Journal of Supercomputing., 2, 2020. Springer Science and Business Media LLC.
Abstract: Solving large sparse linear systems, efficiently, on supercomputing infrastructures is a time-consuming component for a wide variety of simulation processes. An effective parallel solver should meet the required specifications, concerning both convergence behavior and scalability. Herewith, a class of two-stage algebraic domain decomposition preconditioning schemes based on the upper Schur complement method is proposed, in order to exploit appropriately distributed memory systems with multicore processors. The design of the method has been focused on homogeneous hybrid parallel systems, i.e., distributed and shared memory systems. However, the proposed method can also be applied to heterogeneous systems, such as cloud infrastructures, or hybrid parallel systems with accelerators, by modifying the workload distribution algorithm and taking into account the different network latencies and bandwidths. The first stage of the proposed schemes is related to the assignment of the subdomains among the workstations of the distributed system, whereas the second stage concerns the further redistribution of the subdomains to each core of a processor. The proposed method utilizes multiprojection techniques, based on semi-aggregated subdomains, leading to improved convergence behavior as the number of subdomains increases. Moreover, a subspace compression technique is used, in order to improve the performance of the preprocessing phase and reduce the memory requirements of the proposed scheme. The preconditioning schemes were combined with a parallel Krylov subspace method, i.e., the parallel preconditioned GMRES(m) method. The convergence behavior, the performance and the scalability of the proposed preconditioning schemes are examined and compared to existing state-of-the-art methods, by conducting several numerical experiments on supercomputing infrastructures.
BibTeX:
@article{Moutafis2020a,
  author = {B. E. Moutafis and G. A. Gravvanis and C. K. Filelis-Papadopoulos},
  title = {On the design of two-stage multiprojection methods for distributed memory systems},
  journal = {The Journal of Supercomputing},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  doi = {10.1007/s11227-020-03201-5}
}
Moutafis P, Garc\ia-Garc\ia F, Mavrommatis G, Vassilakopoulos M, Corral A and Iribarne L (2020), "Algorithms for processing the group K nearest-neighbor query on distributed frameworks", Distributed and Parallel Databases., 11, 2020. Springer Science and Business Media LLC.
Abstract: Given two datasets of points (called Query and Training), the Group (K) Nearest-Neighbor (GKNN) query retrieves (K) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been studied during the recent years and several performance improving techniques and pruning heuristics have been proposed. In previous work, we presented the first MapReduce algorithm, consisting of alternating local and parallel phases, which can be used to effectively process the GKNN query when the Query fits in memory, while the Training one belongs to the Big Data category. In this paper, we present a significantly improved algorithm that incorporates a new high-performance refining method, a fast way to calculate distance sums for pruning purposes and several other minor coding and algorithmic improvements. Moreover, we transform this algorithm (which has been implemented in the Hadoop framework) to SpatialHadoop (a popular distributed framework that is dedicated to spatial processing), using a novel two-level partitioning method. Using real world and synthetic datasets, we also present a thorough experimental study of the Hadoop and SpatialHadoop versions of the algorithm, including a backstage analysis of the algorithm's performance, using metrics that highlight its internal functioning. Finally, we present an experimental comparison of the Hadoop, the SpatialHadoop versions and the version of our previous work, showing that the improved versions are the big winners, with the SpatialHadoop one being faster than its Hadoop counterpart.
BibTeX:
@article{Moutafis2020b,
  author = {Panagiotis Moutafis and Francisco Garc\ia-Garc\ia and George Mavrommatis and Michael Vassilakopoulos and Antonio Corral and Luis Iribarne},
  title = {Algorithms for processing the group K nearest-neighbor query on distributed frameworks},
  journal = {Distributed and Parallel Databases},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  doi = {10.1007/s10619-020-07317-8}
}
Muehlebach M and Jordan MI (2020), "Continuous-time Lower Bounds for Gradient-based Algorithms", February, 2020.
Abstract: This article derives lower bounds on the convergence rate of continuous-time gradient-based optimization algorithms. The algorithms are subjected to a time-normalization constraint that avoids a reparametrization of time in order to make the discussion of continuous-time convergence rates meaningful. We reduce the multi-dimensional problem to a single dimension, recover well-known lower bounds from the discrete-time setting, and provide insights into why these lower bounds occur. We further explicitly provide algorithms that achieve the proposed lower bounds, even when the function class under consideration includes certain non-convex functions.
BibTeX:
@article{Muehlebach2020,
  author = {Michael Muehlebach and Michael I. Jordan},
  title = {Continuous-time Lower Bounds for Gradient-based Algorithms},
  year = {2020}
}
Mukhopadhyay S (2020), "Stochastic Gradient Descent For Linear Systems With Sequential Matrix Entry Accumulation", Signal Processing., 1, 2020. , pp. 107494. Elsevier BV.
Abstract: Conventional stochastic iterative methods are often employed for solving linear systems of equations involving large matrix sizes using low memory footprint. However, their performances are often limited by the unavailability of all the matrix entries, which is often termed as the problem of missing data. Although Ma and Needell [1] has recently proposed a method, termed as mSGD, assuming a model for data missing that results in improved convergence, their result is also affected by constant large variance of the stochastic gradient. In this paper we propose a SGD type method termed as cumulative information SGD (CISGD) for solving a linear system with missing data with an additional provision to accumulate a very small number of matrix entries sequentially per iteration, termed as the sequential matrix entry accumulation (SEMEA) mechanism. CISGD uses the data collected by SEMEA mechanism along with the prior model for data missing mechanism of [1] to gradually reduce variance of the stochastic gradient. The convergence of the proposed CISGD is theoretically analyzed and some interesting implications of the result are investigated under a specific SEMEA mechanism. Finally, numerical experiments are performed along with simulations that corroborate the theoretical findings regarding the efficacy of the proposed CISGD method.
BibTeX:
@article{Mukhopadhyay2020,
  author = {Samrat Mukhopadhyay},
  title = {Stochastic Gradient Descent For Linear Systems With Sequential Matrix Entry Accumulation},
  journal = {Signal Processing},
  publisher = {Elsevier BV},
  year = {2020},
  pages = {107494},
  doi = {10.1016/j.sigpro.2020.107494}
}
Mukunoki D and Ogita T (2020), "Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs", Journal of Computational and Applied Mathematics., 1, 2020. , pp. 112701. Elsevier BV.
Abstract: This paper presents the implementation, performance, and energy consumption of accurate and mixed-precision linear algebra kernels, including inner-product (DOT), dense matrix–vector multiplication (GEMV), dense matrix multiplication (GEMM), and sparse matrix–vector multiplication (SpMV) for the compressed sparse row (CSR) format (CSRMV), on graphics processing units (GPUs). We employ a mixed-precision design in our implementation, which makes it possible to perform internal floating-point operations with at least 2-fold the precision of the input and output data precision: for binary32 data, the computation is performed on binary64, and for binary64 data, the computation is performed on 2-fold the precision with an accurate inner product algorithm referred to as Dot2. We developed highly optimized implementations which can achieve performance close to the upper bound performance. From our evaluation on Titan V, a Volta architecture GPU, we made the following observations: as the Dot2 operation consumes 11 times binary64 instructions, GEMM requires the corresponding overheads (in terms of both execution time and energy consumption), compared to the standard binary64 implementation. On the other hand, the accuracy of DOT, GEMV, and CSRMV is improved with a very small overhead to the execution time and up to roughly 30% overhead to the energy requirement.
BibTeX:
@article{Mukunoki2020,
  author = {Daichi Mukunoki and Takeshi Ogita},
  title = {Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs},
  journal = {Journal of Computational and Applied Mathematics},
  publisher = {Elsevier BV},
  year = {2020},
  pages = {112701},
  doi = {10.1016/j.cam.2019.112701}
}
Murray M and Tanner J (2020), "The Permuted Striped Block Model and its Factorization -- Algorithms with Recovery Guarantees", April, 2020.
Abstract: We introduce a novel class of matrices which are defined by the factorization Y :=AX, where A is an m × n wide sparse binary matrix with a fixed number d nonzeros per column and X is an n × N sparse real matrix whose columns have at most k nonzeros and are dissociated. Matrices defined by this factorization can be expressed as a sum of n rank one sparse matrices, whose nonzero entries, under the appropriate permutations, form striped blocks - we therefore refer to them as Permuted Striped Block (PSB) matrices. We define the PSB data model as a particular distribution over this class of matrices, motivated by its implications for community detection, provable binary dictionary learning with real valued sparse coding, and blind combinatorial compressed sensing. For data matrices drawn from the PSB data model, we provide computationally efficient factorization algorithms which recover the generating factors with high probability from as few as N =O(nk2(n)) data vectors, where k, m and n scale proportionally. Notably, these algorithms achieve optimal sample complexity up to logarithmic factors.
BibTeX:
@article{Murray2020,
  author = {Michael Murray and Jared Tanner},
  title = {The Permuted Striped Block Model and its Factorization -- Algorithms with Recovery Guarantees},
  year = {2020}
}
Myers JM, Dunlavy DM, Teranishi K and Hollman DS (2020), "Parameter Sensitivity Analysis of the SparTen High Performance Sparse Tensor Decomposition Software: Extended Analysis", December, 2020.
Abstract: Tensor decomposition models play an increasingly important role in modern data science applications. One problem of particular interest is fitting a low-rank Canonical Polyadic (CP) tensor decomposition model when the tensor has sparse structure and the tensor elements are nonnegative count data. SparTen is a high-performance C++ library which computes a low-rank decomposition using different solvers: a first-order quasi-Newton or a second-order damped Newton method, along with the appropriate choice of runtime parameters. Since default parameters in SparTen are tuned to experimental results in prior published work on a single real-world dataset conducted using MATLAB implementations of these methods, it remains unclear if the parameter defaults in SparTen are appropriate for general tensor data. Furthermore, it is unknown how sensitive algorithm convergence is to changes in the input parameter values. This report addresses these unresolved issues with large-scale experimentation on three benchmark tensor data sets. Experiments were conducted on several different CPU architectures and replicated with many initial states to establish generalized profiles of algorithm convergence behavior.
BibTeX:
@article{Myers2020,
  author = {Jeremy M. Myers and Daniel M. Dunlavy and Keita Teranishi and D. S. Hollman},
  title = {Parameter Sensitivity Analysis of the SparTen High Performance Sparse Tensor Decomposition Software: Extended Analysis},
  year = {2020}
}
Navarro CA, Carrasco R, Barrientos RJ, Riquelme JA and Vega R (2020), "GPU Tensor Cores for fast Arithmetic Reductions", January, 2020.
Abstract: This work proposes a GPU tensor core approach that encodes the arithmetic reduction of n numbers as a set of chained m × m matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running time of the proposed chained tensor core approach is T(n)=5 log_m^2n and its speedup is S=45 log_2m^2 over the classic O(n log n) parallel reduction algorithm. Experimental performance results show that the proposed reduction method is ∼ 3.2 × faster than a conventional GPU reduction implementation, and preserves the numerical precision because the sub-results of each chain of R MMAs is kept as a 32-bit floating point value, before being all reduced into as a final 32-bit result. The chained MMA design allows a flexible configuration of thread-blocks; small thread-blocks of 32 or 128 threads can still achieve maximum performance using a chain of R=4,5 MMAs per block, while large thread-blocks work best with R=1. The results obtained in this work show that tensor cores can indeed provide a significant performance improvement to non-Machine Learning applications such as the arithmetic reduction, which is an integration tool for studying many scientific phenomena.
BibTeX:
@article{Navarro2020,
  author = {Cristóbal A. Navarro and Roberto Carrasco and Ricardo J. Barrientos and Javier A. Riquelme and Raimundo Vega},
  title = {GPU Tensor Cores for fast Arithmetic Reductions},
  year = {2020}
}
Nayak P, Cojean T and Anzt H (2020), "Evaluating Abstract Asynchronous Schwarz solvers", March, 2020.
Abstract: With the commencement of the exascale computing era, we realize that the majority of the leadership supercomputers are heterogeneous and massively parallel even on a single node with multiple co-processors such as GPUs and multiple cores on each node. For example, ORNLs Summit accumulates six NVIDIA Tesla V100s and 42 core IBM Power9s on each node. Synchronizing across all these compute resources in a single node or even across multiple nodes is prohibitively expensive. Hence it is necessary to develop and study asynchronous algorithms that circumvent this issue of bulk-synchronous computing for massive parallelism. In this study, we examine the asynchronous version of the abstract Restricted Additive Schwarz method as a solver where we do not explicitly synchronize, but allow for communication of the data between the sub-domains to be completely asynchronous thereby removing the bulk synchronous nature of the algorithm. We accomplish this by using the onesided RMA functions of the MPI standard. We study the benefits of using such an asynchronous solver over its synchronous counterpart on both multi-core architectures and on multiple GPUs. We also study the communication patterns and local solvers and their effect on the global solver. Finally, we show that this concept can render attractive runtime benefits over the synchronous counterparts.
BibTeX:
@article{Nayak2020,
  author = {Pratik Nayak and Terry Cojean and Hartwig Anzt},
  title = {Evaluating Abstract Asynchronous Schwarz solvers},
  year = {2020}
}
Nayak P, Cojean T and Anzt H (2020), "Evaluating asynchronous Schwarz solvers on GPUs", The International Journal of High Performance Computing Applications., 8, 2020. , pp. 109434202094681. SAGE Publications.
Abstract: With the commencement of the exascale computing era, we realize that the majority of the leadership supercomputers are heterogeneous and massively parallel. Even a single node can contain multiple co-processors such as GPUs and multiple CPU cores. For example, ORNL's Summit accumulates six NVIDIA Tesla V100 GPUs and 42 IBM Power9 cores on each node. Synchronizing across compute resources of multiple nodes can be prohibitively expensive. Hence, it is necessary to develop and study asynchronous algorithms that circumvent this issue of bulk-synchronous computing. In this study, we examine the asynchronous version of the abstract Restricted Additive Schwarz method as a solver. We do not explicitly synchronize, but allow the communication between the sub-domains to be completely asynchronous, thereby removing the bulk synchronous nature of the algorithm.\ We accomplish this by using the one-sided Remote Memory Access (RMA) functions of the MPI standard. We study the benefits of using such an asynchronous solver over its synchronous counterpart. We also study the communication patterns governed by the partitioning and the overlap between the sub-domains on the global solver. Finally, we show that this concept can render attractive performance benefits over the synchronous counterparts even for a well-balanced problem.
BibTeX:
@article{Nayak2020a,
  author = {Pratik Nayak and Terry Cojean and Hartwig Anzt},
  title = {Evaluating asynchronous Schwarz solvers on GPUs},
  journal = {The International Journal of High Performance Computing Applications},
  publisher = {SAGE Publications},
  year = {2020},
  pages = {109434202094681},
  doi = {10.1177/1094342020946814}
}
Nayak P, Cojean T and Anzt H (2020), "Two-stage Asynchronous Iterative Solversfor multi-GPU Clusters", In Proceedings of the 11th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems.
Abstract: Given the trend of supercomputers accumulating much of their compute power in GPU accelerators composed of thousands of cores and operating in streaming mode, global synchronization points become a bottleneck, severely confining the performance of applications. In consequence, asynchronous methods breaking up the bulk-synchronous programming model are becoming increasingly attractive. In this paper, we study a GPU-focused asynchronous version of the Restricted Additive Schwarz (RAS) method that employs preconditioned Krylov subspace methods as subdomain solvers. We analyze the method for various parameters such as local solver tolerance and iteration counts. Leveraging the multi-GPU architecture on Summit, we show that these two-stage methods are more memory and time efficient than asynchronous RAS using direct solvers. We also demonstrate the superiority over synchronous counterparts, and present results using one-sided CUDA-aware MPI on up to 36 NVIDIA V100 GPUs.
BibTeX:
@inproceedings{Nayak2020b,
  author = {Pratik Nayak and Terry Cojean and Hartwig Anzt},
  title = {Two-stage Asynchronous Iterative Solversfor multi-GPU Clusters},
  booktitle = {Proceedings of the 11th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems},
  year = {2020}
}
Nesi LL, Pinto VG, Miletto MC and Schnorr LM (2020), "Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure". Thesis at: Université Grenoble Alpes, CNRS, Inria.
Abstract: High-performance computing (HPC) applications enable the solution of computeintensive problems in feasible time. Among many HPC paradigms, task-based programming has gathered community attention in recent years. This paradigm enables constructing an HPC application using a more declarative approach, structuring it in a direct acyclic graph (DAG). The performance evaluation of these applications is as hard as in any other programming paradigm. Understanding how to analyze these applications, employing the DAG and runtime metrics, presents opportunities to improve its performance. This article describes the StarVZ R-package available on CRAN for performance analysis of task-based applications. StarVZ enables transforms runtime trace data into different visualizations of the application behavior. An analyst can understand their applications' performance limitations and compare multiple executions. StarVZ has been successfully applied to several study-cases, showing its applicability in a number of scenarios.
BibTeX:
@techreport{Nesi2020,
  author = {Lucas Leandro Nesi and Vinicius Garcia Pinto and Marcelo Cogo Miletto and Lucas Mello Schnorr},
  title = {Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure},
  school = {Université Grenoble Alpes, CNRS, Inria},
  year = {2020},
  url = {https://hal.inria.fr/hal-02960848/document}
}
Nesterov Y (2020), "Superfast second-order methods for Unconstrained Convex Optimization". Thesis at: UCL - SSH/LIDAM/CORE - Center for operations research and econometrics.
Abstract: In this paper, we present new second-order methods with converge rate O(k^-4), where k is the iteration counter. This is faster that the existing lower bound for this type of schemes [1, 2], which is O(k^-7/2). Our progress can be explained by a finer specification of the problem class. The main idea of this approach consists in implementation of the third-order scheme from [15] using the second-order oracle. At each iteration of our method, we solve a nontrivial auxiliary problem by a linearly convergent scheme based on the relative non-degeneracy condition [3, 10]. During this process, the Hessian of the objective function is computed once, and the gradient is computed O(ln1𝜖) times, where 𝜖 is the desired accuracy of the solution for our problem.
BibTeX:
@techreport{Nesterov2020,
  author = {Yurii Nesterov},
  title = {Superfast second-order methods for Unconstrained Convex Optimization},
  school = {UCL - SSH/LIDAM/CORE - Center for operations research and econometrics},
  year = {2020},
  url = {https://dial.uclouvain.be/pr/boreal/object/boreal%3A227146/datastream/PDF_01/view}
}
Nesterov Y (2020), "Inexact Accelerated High-Order Proximal-Point Methods". Thesis at: UCL - SSH/LIDAM/CORE - Center for operations research and econometrics.
Abstract: In this paper, we present a new framework of Bi-Level Unconstrained Minimization (BLUM) for development of accelerated methods in Convex Programming. These methods use approximations of the high-order proximal points, which are solutions of some auxiliary parametric optimization problems. For computing these points, we can use different methods, and, in particular, the lower-order schemes. This opens a possibility for the latter methods to overpass traditional limits of the Complexity Theory. As an example, we obtain a new second-order method with the convergence rate O(k^-4) , where k is the iteration counter. This rate is better than the maximal possible rate of convergence for this type of methods, as applied to functions with Lipschitz continuous Hessian. We also present new methods with the exact auxiliary search procedure, which have the rate of convergence O(k^-(3p+1)/2), where p ge 1 is the order of the proximal operator. The auxiliary problem at each iteration of these schemes is convex.
BibTeX:
@techreport{Nesterov2020a,
  author = {Yurii Nesterov},
  title = {Inexact Accelerated High-Order Proximal-Point Methods},
  school = {UCL - SSH/LIDAM/CORE - Center for operations research and econometrics},
  year = {2020},
  url = {http://hdl.handle.net/2078.1/227219}
}
Nesterov Y (2020), "Inexact high-order proximal-point methods with auxiliary search procedure". Thesis at: UCL - SSH/LIDAM/CORE - Center for operations research and econometrics.
Abstract: In this paper, we complement the framework of Bi-Level Unconstrained Minimization (BLUM)[21] by a new pth-order proximal-point method convergent as O(k^-(3p+1)/2), where k is the iteration counter. As compared with [21], we replace the auxiliary line search by a convex segment search. This allows us to bound its complexity of by a logarithm of the desired accuracy. Each step in this search needs an approximate computation of the proximal-point operator. Under assumption on boundedness of (p+1)st derivative of the objective function, this can be done by one step of the pth-order augmented tensor method. In this way, for p = 2, we get a new second-order method with the rate of convergence O(^k-7/2) and logarithmic complexity of the auxiliary search at each iteration. Another possibility is to compute the proximal-point operator by lower-order minimization methods. As an example, for p = 3, we consider the upper-level process convergent as O^(k-5). Assuming the boundedness of fourth derivative, an appropriate approximation of the proximal-point operator can be computed by a second-order method in logarithmic number of iterations. This combination gives a second-order scheme with much better complexity than the existing theoretical limits.
BibTeX:
@techreport{Nesterov2020b,
  author = {Yurii Nesterov},
  title = {Inexact high-order proximal-point methods with auxiliary search procedure},
  school = {UCL - SSH/LIDAM/CORE - Center for operations research and econometrics},
  year = {2020},
  url = {http://hdl.handle.net/2078.1/227954}
}
Nesterov Y (2020), "Inexact basic tensor methods for some classes of convex optimization problems", Optimization Methods and Software., December, 2020. , pp. 1-29. Informa UK Limited.
Abstract: In this paper, we analyse the Basic Tensor Methods, which use approximate solutions of the auxiliary problems. The quality of this solution is described by the residual in thefunction value, which must be proportional to \frac{p+1}{p}, where p ge 1 is the order of the method and 𝜖 is the desired accuracy in the main optimization problem. We analyse in details the auxiliary schemes for the third- and second-order tensor methods. The auxiliary problems for the third-order scheme can be solved very efficiently by a linearly convergent gradient-type method with a preconditioner. The most expensive operation in this process is a preliminary factorization of the Hessian of the objective function. For solving the auxiliary problem for the second order scheme, we suggest two variants of the Fast Gradient Methods with restart, which converge as O(1k^6), where k is the iteration counter. Finally, we present the results of the preliminary computational experiments.
BibTeX:
@article{Nesterov2020c,
  author = {Yurii Nesterov},
  title = {Inexact basic tensor methods for some classes of convex optimization problems},
  journal = {Optimization Methods and Software},
  publisher = {Informa UK Limited},
  year = {2020},
  pages = {1--29},
  doi = {10.1080/10556788.2020.1854252}
}
Nguyen DM (2020), "A Combination of CMAES-APOP Algorithm and Quasi-Newton Method", In Advanced Computational Methods for Knowledge Engineering. Cham , pp. 64-74. Springer International Publishing.
Abstract: In this paper, we present an approach for combining the CMAES-APOP with a local search in order to make a hybrid evolutionary algorithm. This combination is based on the information of population size in the evolution process of the CMAES-APOP algorithm while the local search is quasi-Newton line search algorithm. We will give some conditions to efficiently active the local search inside CMAES-APOP. Some numerical experiments on multi-modal optimization problems will show the efficiency of proposed approach.
BibTeX:
@inproceedings{Nguyen2020,
  author = {Nguyen, Duc Manh},
  editor = {Le Thi, Hoai An and Le, Hoai Minh and Pham Dinh, Tao and Nguyen, Ngoc Thanh},
  title = {A Combination of CMAES-APOP Algorithm and Quasi-Newton Method},
  booktitle = {Advanced Computational Methods for Knowledge Engineering},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {64--74}
}
Nie B, Jog A and Smirni E (2020), "Characterizing Accuracy-Aware Resilience of GPGPU Applications", In Proceedings of the 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing. , pp. 111-120.
Abstract: Graphics Processing Units (GPUs) have rapidly evolved to enable energy-efficient data-parallel computing. In addition to achieving exascale performance at a stringent power budget, it is imperative for GPUs to provide reliable computing guarantees to the end user. In current commodity systems, such guarantees are often achieved by incurring high protection cost in terms of performance, power, and hardware resources. However, we argue that these strict guarantees are often not required (and that the associated protected overheads can be significantly reduced) because several GPGPU applications are either fault-tolerant or can accept a quantifiable loss in output quality. To this end, this paper characterizes in a hierarchical manner the accuracy-aware resilience of GPGPU applications consisting of thousands of threads. This characterization study shows that accuracy-aware error resilience exhibits several interesting patterns across threads at different hierarchies (i.e., kernel/thread-block/warp). The insights from this characterization study can be used to reduce the overheads of expensive protection or recovery mechanisms that are typically used by GPUs to ensure application reliability.
BibTeX:
@inproceedings{Nie2020,
  author = {Bin Nie and Adwait Jog and Evgenia Smirni},
  title = {Characterizing Accuracy-Aware Resilience of GPGPU Applications},
  booktitle = {Proceedings of the 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing},
  year = {2020},
  pages = {111--120}
}
Nolet CJ, Lafargue V, Raff E, Nanditale T, Oates T, Zedlewski J and Patterson J (2020), "Bringing UMAP Closer to the Speed of Light with GPU Acceleration", August, 2020.
Abstract: The Uniform Manifold Approximation and Projection (UMAP) algorithm has become widely popular for its ease of use, quality of results, and support for exploratory, unsupervised, supervised, and semi-supervised learning. While many algorithms can be ported to a GPU in a simple and direct fashion, such efforts have resulted in inefficent and inaccurate versions of UMAP. We show a number of techniques that can be used to make a faster and more faithful GPU version of UMAP, and obtain speedups of up to 100x in practice. Many of these design choices/lessons are general purpose and may inform the conversion of other graph and manifold learning algorithms to use GPUs. Our implementation has been made publicly available as part of the open source RAPIDS cuML library(https://github.com/rapidsai/cuml).
BibTeX:
@article{Nolet2020,
  author = {Corey J. Nolet and Victor Lafargue and Edward Raff and Thejaswi Nanditale and Tim Oates and John Zedlewski and Joshua Patterson},
  title = {Bringing UMAP Closer to the Speed of Light with GPU Acceleration},
  year = {2020}
}
Ozkaya MY, Balin MF, Pinar A and Catalyurek Ü (2020), "A scalable graph generation algorithm to sample over a given shell distribution", In Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops., 5, 2020. IEEE.
Abstract: Graphs are commonly used to model the relationships between various entities. These graphs can be enormously large and thus, scalable graph analysis has been the subject of many research efforts. To enable scalable analytics, many researchers have focused on generating realistic graphs that support controlled experiments for understanding how algorithms perform under changing graph features. Significant progress has been made on scalable graph generation which preserve some important graph properties (e.g., degree distribution, clustering coefficients). In this paper, we study how to sample a graph from the space of graphs with a given shell distribution. Shell distribution is related to the k-core, which is the largest subgraph where each vertex is connected to at least kother vertices. A k-shell is the subset of vertices that are in k-core but not ( k +1)-core, and the shell distribution comprises the sizes of these shells. Core decompositions are widely used to extract information from graphs and to assist other computations. We present a scalable shared and distributed memory graph generator that, given a shell decomposition, generates a random graph that conforms to it. Our extensive experimental results show the efficiency and scalability of our methods. Our algorithm generates 2 ^33 vertices and 2 ^37 edges in less than 50 seconds on 384 cores.
BibTeX:
@inproceedings{Ozkaya2020,
  author = {M. Yusuf Ozkaya and M. Fatih Balin and Ali Pinar and ÜmitV. Catalyurek},
  title = {A scalable graph generation algorithm to sample over a given shell distribution},
  booktitle = {Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops},
  publisher = {IEEE},
  year = {2020},
  doi = {10.1109/ipdpsw50202.2020.00051}
}
Pachajoa C, Pacher C, Levonyak M and Gansterer WN (2020), "Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method", July, 2020.
Abstract: As computers reach exascale and beyond, the incidence of faults will increase. Solutions to this problem are an active research topic. We focus on strategies to make the preconditioned conjugate gradient (PCG) solver resilient against node failures, specifically, the exact state reconstruction (ESR) method, which exploits redundancies in PCG. Reducing the frequency at which redundant information is stored lessens the runtime overhead. However, after the node failure, the solver must restart from the last iteration for which redundant information was stored, which increases recovery overhead. This formulation highlights the method's similarities to checkpoint-restart (CR). Thus, this method, which we call ESR with periodic storage (ESRP), can be considered a form of algorithm-based checkpoint-restart. The state is stored implicitly, by exploiting redundancy inherent to the algorithm, rather than explicitly as in CR. We also minimize the amount of data to be stored and retrieved compared to CR, but additional computation is required to reconstruct the solver's state. In this paper, we describe the necessary modifications to ESR to convert it into ESRP, and perform an experimental evaluation. We compare ESRP experimentally with previously-existing ESR and application-level in-memory CR. Our results confirm that the overhead for ESR is reduced significantly, both in the failure-free case, and if node failures are introduced. In the former case, the overhead of ESRP is usually lower than that of CR. However, CR is faster if node failures happen. We claim that these differences can be alleviated by the implementation of more appropriate preconditioners.
BibTeX:
@article{Pachajoa2020,
  author = {Carlos Pachajoa and Christina Pacher and Markus Levonyak and Wilfried N. Gansterer},
  title = {Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method},
  year = {2020}
}
Page BA and Kogge PM (2020), "Scalability of Sparse Matrix Dense Vector Multiply (SpMV) on a Migrating Thread Architecture", In Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops., 5, 2020. IEEE.
Abstract: Sparse matrix dense vector multiplication (SpMV), exhibits the memory bandwidth and communication driven nature of many sparse linear algebra operations. Irregular memory accesses from the non-zero structure within a sparse matrix wreak havoc on performance. This paper presents strong scaling for communication avoiding SpMV implementations on a migrating thread system intended to address the lack of locality in sparse problems. We developed communication avoiding SpMV code to attempt to reduce off-node thread migration by using the hypergraph partitioning package HYPE to determine workload distribution. Additionally, we investigate the performance impact of overlapping communication and computation through the use of remote memory operations supported by the architecture. Incorporating remote memory operations with hypergraph partitioning we achieved 6.18X speedup for overall performance.
BibTeX:
@inproceedings{Page2020,
  author = {Brian A. Page and Peter M. Kogge},
  title = {Scalability of Sparse Matrix Dense Vector Multiply (SpMV) on a Migrating Thread Architecture},
  booktitle = {Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops},
  publisher = {IEEE},
  year = {2020},
  doi = {10.1109/ipdpsw50202.2020.00088}
}
Palleschi A, Mengacci R, Angelini F, Caporale D, Pallottino L, Luca AD and Garabini M (2020), "Time-Optimal Trajectory Planning for Flexible Joint Robots", IEEE Robotics and Automation Letters. , pp. 1-1. Institute of Electrical and Electronics Engineers (IEEE).
Abstract: In this paper, a new approach is proposed to optimally plan the motion along a parametrized path for flexible joint robots, i.e., robots whose structure is purposefully provided with compliant elements. State-of-the-art methods efficiently solve the problem in case of torque-controlled rigid robots via a translation of the optimal control problem into a convex optimization problem. Recently, we showed that, for jerk-controlled rigid robots, the problem could be recast into a non-convex optimization problem. The non-convexity is given by bilinear constraints that can be efficiently handled through McCormick relaxations and spatial Branch-and-Bound techniques. In this paper, we show that, even in case of robots with flexible joints, the time-optimal trajectory planning problem can be recast into a non-convex problem in which the non-convexity is still given by bilinear constraints. We performed experimental tests on a planar 2R elastic manipulator to validate the benefits of the proposed approach. The scalability of the method for robots with multiple degrees of freedom is also discussed.
BibTeX:
@article{Palleschi2020,
  author = {Alessandro Palleschi and Riccardo Mengacci and Franco Angelini and Danilo Caporale and Lucia Pallottino and Alessandro De Luca and Manolo Garabini},
  title = {Time-Optimal Trajectory Planning for Flexible Joint Robots},
  journal = {IEEE Robotics and Automation Letters},
  publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
  year = {2020},
  pages = {1--1},
  doi = {10.1109/lra.2020.2965861}
}
Panagiotas I and Uçar B (2020), "Engineering fast almost optimal algorithms for bipartite graph matching", In Proceedings of the 2020 European Symposium on Algorithms.
Abstract: We consider the maximum cardinality matching problem in bipartite graphs. There are a number of exact, deterministic algorithms for this purpose, whose complexities are high in practice. There are randomized approaches for special classes of bipartite graphs. Random 2-out bipartite graphs, where each vertex chooses two neighbors at random from the other side, form one class for which there is an O(m + n log n)-time Monte Carlo algorithm. Regular bipartite graphs, where all vertices have the same degree, form another class for which there is an expected O(m + n log n)-time Las Vegas algorithm. We investigate these two algorithms and turn them into practical heuristics with randomization. Experimental results show that the heuristics are fast and obtain near optimal matchings. They are also more robust than the state of the art heuristics used in the cardinality matching algorithms, and are generally more useful as initialization routines.
BibTeX:
@inproceedings{Panagiotas2020,
  author = {Ioannis Panagiotas and Bora Uçar},
  title = {Engineering fast almost optimal algorithms for bipartite graph matching},
  booktitle = {Proceedings of the 2020 European Symposium on Algorithms},
  year = {2020},
  url = {https://hal.inria.fr/hal-02463717v3/document}
}
Parger M, Winter M, Mlakar D and Steinberger M (2020), "spECK: accelerating GPU sparse matrix-matrix multiplication through lightweight analysis", In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming., 2, 2020. ACM.
Abstract: parse general matrix-matrix multiplication on GPUs is challenging due to the varying sparsity patterns of sparse matrices. Existing solutions achieve good performance for certain types of matrices, but fail to accelerate all kinds of matrices in the same manner. Our approach combines multiple strategies with dynamic parameter selection to dynamically choose and tune the best fitting algorithm for each row of the matrix. This choice is supported by a lightweight, multi-level matrix analysis, which carefully balances analysis cost and expected performance gains. Our evaluation on thousands of matrices with various characteristics shows that we outperform all currently available solutions in 79% over all matrices with >15k products and that we achieve the second best performance in 15%. For these matrices, our solution is on average 83% faster than the second best approach and up to 25× faster than other state-of-the-art GPU implementations. Using our approach, applications can expect great performance independent of the matrices they work on.
BibTeX:
@inproceedings{Parger2020,
  author = {Mathias Parger and Martin Winter and Daniel Mlakar and Markus Steinberger},
  title = {spECK: accelerating GPU sparse matrix-matrix multiplication through lightweight analysis},
  booktitle = {Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
  publisher = {ACM},
  year = {2020},
  doi = {10.1145/3332466.3374521}
}
Pearson C, Wu K, Chung I-H, Xiong J and Hwu W-M (2020), "TEMPI: An Interposed MPI Library with a Canonical Representation of CUDA-aware Datatypes", December, 2020.
Abstract: MPI derived datatypes are an abstraction that simplifies handling of non-contiguous data in MPI applications. These datatypes are recursively constructed at runtime from primitive Named Types defined in the MPI standard. More recently, the development and deployment of CUDA-aware MPI implementations has encouraged the transition of distributed high-performance MPI codes to use GPUs. Such implementations allow MPI functions to directly operate on GPU buffers, easing integration of GPU compute into MPI codes. Despite substantial attention to CUDA-aware MPI implementations, they continue to offer cripplingly poor GPU performance when manipulating derived datatypes on GPUs. This work presents a new MPI library, TEMPI, to address this issue. TEMPI first introduces a common datatype to represent equivalent MPI derived datatypes. TEMPI can be used as an interposed library on existing MPI deployments without system or application changes. Furthermore, this work presents a performance model of GPU derived datatype handling, demonstrating that previously preferred "one-shot" methods are not always fastest. Ultimately, the interposed-library model of this work demonstrates MPI_Pack speedup of up to 242,000x and MPI_Send speedup of up to 59,000x compared to the MPI implementation deployed on a leadership-class supercomputer. This yields speedup of more than 1000x in a 3D halo exchange at 192 ranks.
BibTeX:
@article{Pearson2020,
  author = {Carl Pearson and Kun Wu and I-Hsin Chung and Jinjun Xiong and Wen-Mei Hwu},
  title = {TEMPI: An Interposed MPI Library with a Canonical Representation of CUDA-aware Datatypes},
  year = {2020}
}
Perez AC, Acosta A, Almeida F and Blanco V (2020), "A dynamic Multi-Objective approach for dynamic load balancing in heterogeneous systems", IEEE Transactions on Parallel and Distributed Systems. , pp. 1-1. Institute of Electrical and Electronics Engineers (IEEE).
Abstract: Modern standards in High Performance Computing (HPC) have started to consider energy consumption and power draw as a limiting factor. New and more complex architectures have been introduced in HPC systems to afford these new restrictions, and include coprocessors such as GPGPUs for intensive computational tasks. As systems increase in heterogeneity, workload distribution becomes a more core problem to achieve the maximum efficiency in every computational component. We present a Multi-Objective Dynamic Load Balancing (DLB) approach where several objectives can be applied to tune an application. These objectives can be dynamically exchanged during the execution of an algorithm to better adapt to the resources available in a system. We have implemented the Multi-Objective DLB together with a generic heuristic engine, designed to perform multiple strategies for DLB in iterative problems. We also present Ull Multiobjective Framework (UllMF), an open-source tool that implements the Multi-Objective generic approach. UllMF separates metric gathering, objective functions to be optimized and load balancing algorithms, and improves code portability using a simple interface to reduce the costs of new implementations. We illustrate how performance and energy consumption are improved for the implemented techniques, and analyze their quality using different DLB techniques from the literature.
BibTeX:
@article{Perez2020,
  author = {Alberto Cabrera Perez and Alejandro Acosta and Francisco Almeida and Vicente Blanco},
  title = {A dynamic Multi-Objective approach for dynamic load balancing in heterogeneous systems},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
  year = {2020},
  pages = {1--1},
  doi = {10.1109/tpds.2020.2989869}
}
Pham M, Ninh A, Le H and Liu Y (2020), "An efficient algorithm for minimizing multi non-smooth component functions", Journal of Computational and Graphical Statistics., 8, 2020. , pp. 1-23. Informa UK Limited.
Abstract: Many problems in statistics and machine learning can be formulated as an optimization problem of a finite sum of non-smooth convex functions. We propose an algorithm to minimize this type of objective functions based on the idea of alternating linearization. Our algorithm retains the simplicity of contemporary methods without any restrictive assumptions on the smoothness of the loss function. We apply our proposed method to solve two challenging problems: overlapping group Lasso and convex regression with sharp partitions (CRISP). Numerical experiments show that our method is superior to the state-of-the-art algorithms, many of which are based on the accelerated proximal gradient method.
BibTeX:
@article{Pham2020,
  author = {Minh Pham and Anh Ninh and Hoang Le and Yufeng Liu},
  title = {An efficient algorithm for minimizing multi non-smooth component functions},
  journal = {Journal of Computational and Graphical Statistics},
  publisher = {Informa UK Limited},
  year = {2020},
  pages = {1--23},
  doi = {10.1080/10618600.2020.1804390}
}
Pi J, Wang H and Pardalos PM (2020), "A Dual Reformulation and Solution Framework for Regularized Convex Clustering Problems", European Journal of Operational Research., 9, 2020. Elsevier BV.
Abstract: Clustering techniques are powerful tools commonly used in statistical learning and data analytics. Most of the past research formulates clustering tasks as a non-convex problem, where a global optimum often cannot be found. Recent studies show that hierarchical clustering and k-means clustering can be relaxed and analyzed as a convex problem. Moreover, sparse convex clustering algorithms are proposed to extend the convex clustering framework to high-dimensional space by introducing an adaptive group-Lasso penalty term. Due to the non-smoothness nature of the associated objective functions, there are still no efficient fast-convergent algorithms for clustering problems even with convexity. In this paper, we first review the structure of convex clustering problems and prove the differentiability of their dual problems. We then show that such reformulated dual problems can be efficiently solved by the accelerated first-order methods with the feasibility projection. Furthermore, we present a general framework for convex clustering with regularization terms and discuss a specific implementation of this framework using L_1,1-norm. We also derive the dual form for the regularized convex clustering problems and show that it can be efficiently solved by embedding a projection operator and a proximal operator in the accelerated gradient method. Finally, we compare our approach with several other co-clustering algorithms using a number of example clustering problems. Numerical results show that our models and solution methods outperform all the compared algorithms for both convex clustering and convex co-clustering.
BibTeX:
@article{Pi2020,
  author = {J. Pi and Honggang Wang and Panos M. Pardalos},
  title = {A Dual Reformulation and Solution Framework for Regularized Convex Clustering Problems},
  journal = {European Journal of Operational Research},
  publisher = {Elsevier BV},
  year = {2020},
  doi = {10.1016/j.ejor.2020.09.010}
}
Pirkelbauer P, Lin P-H, Vanderbruggen T and Liao C (2020), "XPlacer: Automatic Analysis of Data AccessPatterns on Heterogeneous CPU/GPU Systems", In Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium.
Abstract: This paper presents XPlacer, a framework to automatically analyze problematic data access patterns in C++ and CUDA code. XPlacer records heap memory operations in both host and device code for later analysis. To this end, XPlacer instruments read and write operations, function calls, and kernel launches. Programmers mark points in the program execution where the recorded data is analyzed and anomalies diagnosed. XPlacer reports data access anti-patterns, including alternating CPU/GPU accesses to the same memory, memory with low access density, and unnecessary data transfers. The diagnostic also produces summative information about the recorded accesses, which aids users in identifying code that could degrade performance.\ The paper evaluates XPlacer using LULESH, a Lawrence Livermore proxy application, Rodina benchmarks, and an implementation of the Smith-Waterman algorithm. XPlacer diagnosed several performance issues in these codes. The elimination of a performance problem in LULESH resulted in a 3× speedup on a heterogeneous platform combining Intel CPUs and Nvidia GPUs.
BibTeX:
@inproceedings{Pirkelbauer2020,
  author = {Peter Pirkelbauer and Pei-Hung Lin and Tristan Vanderbruggen and Chunhua Liao},
  title = {XPlacer: Automatic Analysis of Data AccessPatterns on Heterogeneous CPU/GPU Systems},
  booktitle = {Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium},
  year = {2020},
  url = {https://www.osti.gov/servlets/purl/1630806}
}
Ploskas N, Sahinidis NV and Samaras N (2020), "A triangulation and fill-reducing initialization procedure for the simplex algorithm", Mathematical Programming Computation., 6, 2020. Springer Science and Business Media LLC.
Abstract: The computation of an initial basis is of great importance for simplex algorithms since it determines to a large extent the number of iterations and the computational effort needed to solve linear programs. We propose three algorithms that aim to construct an initial basis that is sparse and will reduce the fill-in and computational effort during LU factorization and updates that are utilized in modern simplex implementations. The algorithms rely on triangulation and fill-reducing ordering techniques that are invoked prior to LU factorization. We compare the performance of the CPLEX 12.6.1 primal and dual simplex algorithms using the proposed starting bases against CPLEX using its default crash procedure over a set of 95 large benchmarks (NETLIB, Kennington, Mészáros, Mittelmann). The best proposed algorithm utilizes METIS (Karypis and Kumar in SIAM J Sci Comput 20:359–392, 1998), produces remarkably sparse starting bases, and results in 5% reduction of the geometric mean of the execution time of CPLEX's primal simplex algorithm. Although the proposed algorithm improves CPLEX's primal simplex algorithm across all problem types studied in this paper, it performs better on hard problems, i.e., the instances for which the CPLEX default requires over 1000 s. For these problems, the proposed algorithm results in 37% reduction of the geometric mean of the execution time of CPLEX's primal simplex algorithm. The proposed algorithm also reduces the execution time of CPLEX's dual simplex on hard instances by 10%. For the instances that are most difficult for CPLEX, and for which CPLEX experiences numerical difficulties as it approaches the optimal solution, the best proposed algorithm speeds up CPLEX by more than 10 times. Finally, the proposed algorithms lead to a natural way to parallelize CPLEX with speedups over CPLEX's dual simplex of 1.2 and 1.3 on two and four cores, respectively.
BibTeX:
@article{Ploskas2020,
  author = {Nikolaos Ploskas and Nikolaos V. Sahinidis and Nikolaos Samaras},
  title = {A triangulation and fill-reducing initialization procedure for the simplex algorithm},
  journal = {Mathematical Programming Computation},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  doi = {10.1007/s12532-020-00188-1}
}
Porcelli M and Toint PL (2020), "Global and local information in structured derivative free optimization with BFO", January, 2020.
Abstract: A structured version of derivative-free random pattern search optimization algorithms is introduced which is able to exploit coordinate partially separable structure (typically associated with sparsity) often present in unconstrained and bound-constrained optimization problems. This technique improves performance by orders of magnitude and makes it possible to solve large problems that otherwise are totally intractable by other derivative-free methods. A library of interpolation-based modelling tools is also described, which can be associated to the structured or unstructured versions of the initial BFO pattern search algorithm. The use of the library further enhances performance, especially when associated with structure. The significant gains in performance associated with these two techniques are illustrated using a new freely-available release of BFO which incorporates them. A interesting conclusion of the results presented is that providing global structural information on a problem can result in significantly less evaluations of the objective function than attempting to building local Taylor-like models.
BibTeX:
@article{Porcelli2020,
  author = {Margherita Porcelli and Philippe L. Toint},
  title = {Global and local information in structured derivative free optimization with BFO},
  year = {2020}
}
Priest BW, Dunton A and Sanders G (2020), "Scaling Graph Clustering with Distributed Sketches", July, 2020.
Abstract: The unsupervised learning of community structure, in particular the partitioning vertices into clusters or communities, is a canonical and well-studied problem in exploratory graph analysis. However, like most graph analyses the introduction of immense scale presents challenges to traditional methods. Spectral clustering in distributed memory, for example, requires hundreds of expensive bulk-synchronous communication rounds to compute an embedding of vertices to a few eigenvectors of a graph associated matrix. Furthermore, the whole computation may need to be repeated if the underlying graph changes some low percentage of edge updates. We present a method inspired by spectral clustering where we instead use matrix sketches derived from random dimension-reducing projections. We show that our method produces embeddings that yield performant clustering results given a fully-dynamic stochastic block model stream using both the fast Johnson-Lindenstrauss and CountSketch transforms. We also discuss the effects of stochastic block model parameters upon the required dimensionality of the subsequent embeddings, and show how random projections could significantly improve the performance of graph clustering in distributed memory.
BibTeX:
@article{Priest2020,
  author = {Benjamin W. Priest and Alec Dunton and Geoffrey Sanders},
  title = {Scaling Graph Clustering with Distributed Sketches},
  year = {2020}
}
Quirynen R and Cairano SD (2020), "Block-Structured Preconditioning of Iterative Solverswithin a Primal Active-Set Method for fast MPC". Thesis at: Mitsubishi Electric Research Laboratories (MERL).
Abstract: Model predictive control (MPC) for linear dynamical systems requires solving an optimal control structured quadratic program (QP) at each sampling instant. This paper proposes a primal active-set strategy, called PRESAS, for the efficient solution of such block-sparse QPs, based on a preconditioned iterative solver to compute the search direction in each iteration. Rank-one factorization updates of the preconditioner result in a per-iteration computational complexity of &Oscr;(Nm^2), where m denotes the number of state and control variables and N the number of control intervals. Three different block-structured preconditioning techniques are presented and their numerical properties are studied further. In addition, an augmented Lagrangian based implementation is proposed to avoid a costly initialization procedure to find a primal feasible starting point. Based on a standalone C code implementation, we illustrate the computational performance of PRESAS against current state of the art QP solvers for multiple linear and nonlinear MPC case studies. We also show that the solver is real-time feasible on a dSPACE MicroAutoBox-II rapid prototyping unit for vehicle control applications, and numerical reliability is illustrated based on experimental results from a testbench of small-scale autonomous vehicles.
BibTeX:
@techreport{Quirynen2020,
  author = {Rien Quirynen and Stefano Di Cairano},
  title = {Block-Structured Preconditioning of Iterative Solverswithin a Primal Active-Set Method for fast MPC},
  school = {Mitsubishi Electric Research Laboratories (MERL)},
  year = {2020},
  url = {https://www.merl.com/publications/docs/TR2020-134.pdf}
}
Rahimian H and Mehrotra S (2020), "Sequential Convexification of a Bilinear Set"
Abstract: We present a sequential convexification procedure to derive, in the limit, a set arbitrary close to the convex hull of 𝜖-feasible solutions to a general nonconvex continuous bilinear set. Recognizing that bilinear terms can be represented with a finite number nonlinear nonconvex constraints in the lifted matrix space, our procedure performs a sequential convexification with respect to all nonlinear nonconvex constraints. Moreover, our approach relies on generating liftand-project cuts using simple 0-1 disjunctions, where cuts are generated at all fractional extreme point solutions of the current relaxation. An implication of our convexification procedure is that the constraints describing the convex hull can be used in a cutting plane algorithm to solve a linear optimization problem over the bilinear set to 𝜖-optimality
BibTeX:
@article{Rahimian2020,
  author = {Hamed Rahimian and Sanjay Mehrotra},
  title = {Sequential Convexification of a Bilinear Set},
  year = {2020},
  url = {http://www.optimization-online.org/DB_FILE/2020/01/7595.pdf}
}
Raponi E, Wang H, Bujny M, Boria S and Doerr C (2020), "High Dimensional Bayesian Optimization Assisted by Principal Component Analysis", July, 2020.
Abstract: Bayesian Optimization (BO) is a surrogate-assisted global optimization technique that has been successfully applied in various fields, e.g., automated machine learning and design optimization. Built upon a so-called infill-criterion and Gaussian Process regression (GPR), the BO technique suffers from a substantial computational complexity and hampered convergence rate as the dimension of the search spaces increases. Scaling up BO for high-dimensional optimization problems remains a challenging task. In this paper, we propose to tackle the scalability of BO by hybridizing it with a Principal Component Analysis (PCA), resulting in a novel PCA-assisted BO (PCA-BO) algorithm. Specifically, the PCA procedure learns a linear transformation from all the evaluated points during the run and selects dimensions in the transformed space according to the variability of evaluated points. We then construct the GPR model, and the infill-criterion in the space spanned by the selected dimensions. We assess the performance of our PCA-BO in terms of the empirical convergence rate and CPU time on multi-modal problems from the COCO benchmark framework. The experimental results show that PCA-BO can effectively reduce the CPU time incurred on high-dimensional problems, and maintains the convergence rate on problems with an adequate global structure. PCA-BO therefore provides a satisfactory trade-off between the convergence rate and computational efficiency opening new ways to benefit from the strength of BO approaches in high dimensional numerical optimization.
BibTeX:
@article{Raponi2020,
  author = {Elena Raponi and Hao Wang and Mariusz Bujny and Simonetta Boria and Carola Doerr},
  title = {High Dimensional Bayesian Optimization Assisted by Principal Component Analysis},
  year = {2020}
}
Regev S and Saunders MA (2020), "SSAI: A Symmetric Sparse Approximate Inverse Preconditioner for the Conjugate Gradient Methods PCG and PCGLS". Thesis at: Stanford University.
Abstract: We propose a method for solving a Hermitian positive definite linear system Ax = b, where A is an explicit sparse matrix (real or complex). A sparse approximate right inverse M is computed and replaced by its symmetrization M, which is used as a left-right preconditioner in a modified version of the preconditioned conjugategradient method (PCG), where M is modified occasionally, if necessary, to make it more positive definite. M is formed column by column and can therefore be computed in parallel. PCG requires only matrix-vector multiplications with A and M (not solving a linear system with M), and so too can be carried out in parallel. We compare it with incomplete Cholesky factorization (the gold standard for PCG) and with MATLAB's backslash operator (sparse Cholesky) on matrices from various applications. For least-squares problems, we implement an analogous form of preconditioned Conjugate Gradient Least-Squares (PCGLS) which is also shown to be robust.
BibTeX:
@techreport{Regev2020,
  author = {Shaked Regev and Michael A. Saunders},
  title = {SSAI: A Symmetric Sparse Approximate Inverse Preconditioner for the Conjugate Gradient Methods PCG and PCGLS},
  school = {Stanford University},
  year = {2020},
  url = {https://web.stanford.edu/group/SOL/reports/M130639.pdf}
}
Ren Y and Gleich DF (2020), "A Simple Study of Pleasing Parallelism on Multicore Computers", In Parallel Algorithms in Computational Science and Engineering. , pp. 325-346. Springer International Publishing.
Abstract: Pleasingly parallel computations are those that involve completely independent work. We investigate these in the context of a problem we call AllPageRank. The AllPageRank problem involves computing a subset of accurate PageRank entries for each possible seeded PageRank vector. AllPageRank is representative of a wider class of possible computational procedures that will run a large number of experiments on a single graph structure. Our study involves computing the AllPageRank vectors for a multi-million node graph within a reasonable timeframe on a modern shared memory, high-core count computer. For this setting, we parallelize over all of the seeded PageRank vector computations, which are all independent. The experiments demonstrate that there are non-trivial complexities in obtaining performance even in this ideal situation. For instance, threading computational environments gave scaling problems with a shared graph structure in memory. Also sparse matrix ordering techniques and multivector, or SIMD, optimizations were required to get a total runtime of a few days. We also show how different algorithms for PageRank that have different algorithmic advances and memory access patterns behave to guide future investigation of similar problems.
BibTeX:
@incollection{Ren2020,
  author = {Yanfei Ren and David F. Gleich},
  title = {A Simple Study of Pleasing Parallelism on Multicore Computers},
  booktitle = {Parallel Algorithms in Computational Science and Engineering},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {325--346},
  doi = {10.1007/978-3-030-43736-7_11}
}
Rezaei J, Zare-Mirakabad F, MirHassani SA and Marashi S-A (2020), "EIA-CNDP: An Exact Iterative Algorithm for Critical Node Detection Problem", Computers & Operations Research., November, 2020. , pp. 105138. Elsevier BV.
Abstract: In designing reliable and impermeable networks, the robustness of the network is evaluated against the removal and failure of the node or edge where the network robustness (network connectivity) is measured using various metrics (objective functions) such as the number of connected components, size of the largest connected component, and pairwise connectivity. Critical node detection problem (CNDP) is one of the main issues in this literature, which aims to find a set of vertices whose removal maximizes or minimizes some objective function. In this paper, the focus is on solving CNDP, considering the size of the largest connected component as its objective function. In this regard, we introduce a new problem called K-Group-Division-Problem and present a mixed integer linear programming model to solve it. We prove that under certain circumstances, any optimal solution of the new problem is also an optimal solution of CNDP. Analyzing the performance of the proposed model on solving CNDP, indicates that this model is highly competitive against the base model in the literature. Furthermore, a novel exact algorithm is introduced which improves the proposed mixed integer linear programming model to address CNDP more efficiently. The results show that the proposed algorithm is much more efficient, and, compared with the base model, it can solve the problem on networks with a higher number of nodes.
BibTeX:
@article{Rezaei2020,
  author = {Javad Rezaei and Fatemeh Zare-Mirakabad and Seyed Ali MirHassani and Sayed-Amir Marashi},
  title = {EIA-CNDP: An Exact Iterative Algorithm for Critical Node Detection Problem},
  journal = {Computers & Operations Research},
  publisher = {Elsevier BV},
  year = {2020},
  pages = {105138},
  doi = {10.1016/j.cor.2020.105138}
}
Rodomanov A and Nesterov Y (2020), "Greedy Quasi-Newton Methods with Explicit Superlinear Convergence", February, 2020.
Abstract: In this paper, we study greedy variants of quasi-Newton methods. They are based on the updating formulas from a certain subclass of the Broyden family. In particular, this subclass includes the well-known DFP, BFGS and SR1 updates. However, in contrast to the classical quasi-Newton methods, which use the difference of successive iterates for updating the Hessian approximations, our methods apply basis vectors, greedily selected so as to maximize a certain measure of progress. For greedy quasi-Newton methods, we establish an explicit non-asymptotic bound on their rate of local superlinear convergence, which contains a contraction factor, depending on the square of the iteration counter. We also show that these methods produce Hessian approximations whose deviation from the exact Hessians linearly convergences to zero.
BibTeX:
@article{Rodomanov2020,
  author = {Anton Rodomanov and Yurii Nesterov},
  title = {Greedy Quasi-Newton Methods with Explicit Superlinear Convergence},
  year = {2020}
}
Rodomanov A and Nesterov Y (2020), "Rates of Superlinear Convergence for Classical Quasi-Newton Methods", March, 2020.
Abstract: We study the local convergence of classical quasi-Newton methods for nonlinear optimization. Although it was well established a long time ago that asymptotically these methods converge superlinearly, the corresponding rates of convergence still remain unknown. In this paper, we address this problem. We obtain first explicit non-asymptotic rates of superlinear convergence for the standard quasi-Newton methods, which are based on the updating formulas from the convex Broyden class. In particular, for the well-known DFP and BFGS methods, we obtain the rates of the form (n L^22 k)^k/2 and (n Lμ k)^k/2 respectively, where k is the iteration counter, n is the dimension of the problem, μ is the strong convexity parameter, and L is the Lipschitz constant of the gradient.
BibTeX:
@article{Rodomanov2020a,
  author = {Anton Rodomanov and Yurii Nesterov},
  title = {Rates of Superlinear Convergence for Classical Quasi-Newton Methods},
  year = {2020}
}
Rodomanov A and Nesterov Y (2020), "New Results on Superlinear Convergence of Classical Quasi-Newton Methods", CORE Discussion Papers ; 2020/13 (2020) 24 pages http://hdl.handle.net/2078.1/229640., April, 2020.
Abstract: We present a new theoretical analysis of local superlinear convergence of the classical quasi-Newton methods from the convex Broyden class. Our analysis is based on the potential function involving the logarithm of determinant of Hessian approximation and the trace of inverse Hessian approximation. For the well-known DFP and BFGS methods, we obtain the rates of the form [Lμ (expleft\frac{n}{k} \ln \frac{L}{\mu}\right\ - 1)]^k/2 and [expleft\frac{n}{k} \ln \frac{L}{\mu}\right\ - 1]^k/2 respectively, where k is the iteration counter, n is the dimension of the problem, μ is the strong convexity parameter, and L is the Lipschitz constant of the gradient. Currently, these are the best known superlinear convergence rates for these methods. In particular, our results show that the starting moment of superlinear convergence of BFGS method depends on the logarithm of the condition number Lμ in the worst case.
BibTeX:
@article{Rodomanov2020b,
  author = {Anton Rodomanov and Yurii Nesterov},
  title = {New Results on Superlinear Convergence of Classical Quasi-Newton Methods},
  journal = {CORE Discussion Papers ; 2020/13 (2020) 24 pages http://hdl.handle.net/2078.1/229640},
  year = {2020}
}
Roy A, Balasubramanian K, Ghadimi S and Mohapatra P (2020), "Escaping Saddle-Points Faster under Interpolation-like Conditions", September, 2020.
Abstract: In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an over-parametrization setting, the first-order oracle complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach an 𝜖-local-minimizer, matches the corresponding deterministic rate of \mathcal{O}(1/2). We next analyze Stochastic Cubic-Regularized Newton (SCRN) algorithm under interpolation-like conditions, and show that the oracle complexity to reach an 𝜖-local-minimizer under interpolation-like conditions, is \mathcal{O}(1/2.5). While this obtained complexity is better than the corresponding complexity of either PSGD, or SCRN without interpolation-like assumptions, it does not match the rate of \mathcal{O}(1/1.5) corresponding to deterministic Cubic-Regularized Newton method. It seems further Hessian-based interpolation-like assumptions are necessary to bridge this gap. We also discuss the corresponding improved complexities in the zeroth-order settings.
BibTeX:
@article{Roy2020,
  author = {Abhishek Roy and Krishnakumar Balasubramanian and Saeed Ghadimi and Prasant Mohapatra},
  title = {Escaping Saddle-Points Faster under Interpolation-like Conditions},
  year = {2020}
}
Said NA, Benabdenbi M and Morin-Allory K (2020), "FPU Bit-Width Optimization for Approximate Computing: A Non-Intrusive Approach", In 2020 15th Design & Technology of Integrated Systems in Nanoscale Era (DTIS)., 4, 2020. IEEE.
Abstract: F1oating-Point Units (FPUs) count as a significant part of computing resources in modern general-purpose and application-specific processors. Full-precision FPUs can be a source of extensive hardware overhead (power consumption, area, memory footprint etc.). On the other hand, several applications feature the inherent ability to tolerate precision loss. This has lead to the development of a new computing paradigm: Transprecision Computing (TC), where variable and arbitrary precision hardware FPUs have been introduced. Many tools and libraries have been proposed to simulate the effects of approximation on applications, to help designers to select the most optimized FPU architecture adequate for a given application.However, existing techniques require developers to rewrite part or all of their existing software stacks (applications, libraries, operating systems …), which is often infeasible, complex or at least a very time-consuming development effort.This work proposes a non-intrusive approach, which does not need source code modification, by introducing approximations at the low-level in assembly. This allows approximating virtually all kinds of executable binaries (bare-metal applications, single- /multi-threaded user applications, OS/RTOS, etc.).We implement the approach on top of the well known QEMU dynamic binary translator. We perform experiments on a set of benchmarks from the literature, and we demonstrate how the approach further simplifies evaluating the impact of FP approximations on numerical applications outputs, without being intrusive to the source code.
BibTeX:
@inproceedings{Said2020,
  author = {Noureddine Ait Said and Mounir Benabdenbi and Katell Morin-Allory},
  title = {FPU Bit-Width Optimization for Approximate Computing: A Non-Intrusive Approach},
  booktitle = {2020 15th Design & Technology of Integrated Systems in Nanoscale Era (DTIS)},
  publisher = {IEEE},
  year = {2020},
  doi = {10.1109/dtis48698.2020.9080931}
}
van Santen VM, Diep FLF, Henkel J and Amrouch H (2020), "Massively Parallel Circuit Setup in GPU-SPICE", IEEE Transactions on Computers. Institute of Electrical and Electronics Engineers (IEEE).
Abstract: SPICE simulations are the industry standard to analyze circuits for decades. However, they are computationally complex as each circuit is simulated at the transistor-level. However, this is in a direct conflict with the ever-increasing demands of circuit designers in which SPICE simulations for large circuits (e.g., DSPs, AES, etc.) at full accuracy are inevitably required to fulfill new industrial standards like automotive safety ISO 26262 with tool confidence level 1. To accelerate SPICE simulation without sacrificing accuracy, state-of-the-art approaches have started to employ GPUs to parallelize the LU-factorization and device linearization phases. Instead of focusing on these phases, this work demonstrates for the first time that when large circuits come into play, a new and equally important performance bottleneck emerges at the circuit setup phase. Speeding up the circuit setup phase in SPICE is our key focus in this paper. Our two implementations demonstrate that our GPU-based circuit setup reduces the analysis time from 4.5 days to merely 89 seconds for a 256-bit multiplier, which consists of more than 1M transistors. Our achieved speedup is 4396x compared to the baseline (open-source NGSPICE) and more than 2x compared to commercial (HSPICE and Spectre) SPICE circuit setup.
BibTeX:
@article{Santen2020,
  author = {Victor M. van Santen and Fu Lam Florian Diep and Jorg Henkel and Hussam Amrouch},
  title = {Massively Parallel Circuit Setup in GPU-SPICE},
  journal = {IEEE Transactions on Computers},
  publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
  year = {2020},
  doi = {10.1109/tc.2020.3032343}
}
dos Santos FF, Brandalero M, Basso PM, Hubner M, Carro L and Rech P (2020), "Reduced-Precision DWC for Mixed-Precision GPUs", In Proceedings of the 26th IEEE International Symposium on On-Line Testing and Robust System Design., 7, 2020. IEEE.
Abstract: Duplication with Comparison (DWC) is an effective software-level solution to improve the reliability of computing systems, including Graphics Processing Units (GPUs). DWC, however, introduces performance and energy consumption overheads that could be unacceptable for High-Performance Computing (HPC) or real-time safety-critical applications. In this work, we propose Reduced-Precision DWC (RP-DWC): an improvement over the traditional DWC approach that uses mixed-precision GPUs hardware resources to implement fault detection. We investigate, through both fault injection campaigns and accelerated neutron beam experiments, the impact of RPDWC onto performance, energy consumption, and its fault detection capabilites. We show that RP-DWC achieves on average 74% fault coverage (up to 86%) with very small overheads (0.1% time and 24% energy consumption overhead, in the best case).
BibTeX:
@inproceedings{Santos2020,
  author = {Fernando Fernandes dos Santos and Marcelo Brandalero and Pedro Martins Basso and Michael Hubner and Luigi Carro and Paolo Rech},
  title = {Reduced-Precision DWC for Mixed-Precision GPUs},
  booktitle = {Proceedings of the 26th IEEE International Symposium on On-Line Testing and Robust System Design},
  publisher = {IEEE},
  year = {2020},
  doi = {10.1109/iolts50870.2020.9159748}
}
Sao P, Kannan R, Gera P and Vuduc R (2020), "A supernodal all-pairs shortest path algorithm", In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming., 2, 2020. ACM.
Abstract: We show how to exploit graph sparsity in the Floyd-Warshall algorithm for the all-pairs shortest path (Apsp) problem. Floyd-Warshall is an attractive choice for Apsp on high-performing systems due to its structural similarity to solving dense linear systems and matrix multiplication. However, if sparsity of the input graph is not properly exploited, Floyd-Warshall will perform unnecessary asymptotic work and thus may not be a suitable choice for many input graphs. To overcome this limitation, the key idea in our approach is to use the known algebraic relationship between Floyd-Warshall and Gaussian elimination, and import several algorithmic techniques from sparse Cholesky factorization, namely, fill-in reducing ordering, symbolic analysis, supernodal traversal, and elimination tree parallelism. When combined, these techniques reduce computation, improve locality and enhance parallelism. We implement these ideas in an efficient shared memory parallel prototype that is orders of magnitude faster than an efficient multi-threaded baseline Floyd-Warshall that does not exploit sparsity. Our experiments suggest that the Floyd-Warshall algorithm can compete with Dijkstra's algorithm (the algorithmic core of Johnson's algorithm) for several classes sparse graphs.
BibTeX:
@inproceedings{Sao2020,
  author = {Piyush Sao and Ramakrishnan Kannan and Prasun Gera and Richard Vuduc},
  title = {A supernodal all-pairs shortest path algorithm},
  booktitle = {Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
  publisher = {ACM},
  year = {2020},
  doi = {10.1145/3332466.3374533}
}
Sayegh ATA and Sotelino ED (2020), "A new row-wise parallel finite element analysis algorithm with dynamic load balancing", International Journal of Earthquake and Impact Engineering. Vol. 3(2), pp. 120. Inderscience Publishers.
Abstract: A parallel scheme is devised to efficiently parallelise all steps of parallel finite element analysis in this study. In addition, this scheme is based on a row-wise matrix distribution. A new row-wise parallel finite element analysis algorithm that exploits the nature of distributed compressed row sparse matrices and multivectors to improve concurrency is developed. A new dynamic load balancing technique has also been devised. The dynamic load balancing technique has been designed specifically to balance the computational workload among processors suitable for the analysis of nonlinear structures. This new algorithm has been implemented in ParaStruc, which is a parallel structural analysis system. Trilinos, a set of parallel numerical libraries developed by researchers in the Sandia National Laboratory has been used to build this algorithm. ParaStruc is a lightweight fully parallelised parallel finite element analysis system, which contains only three classes and a pre-processor. It is shown that this approach produces superior performance in terms of speedup, efficiency, and isoefficiency in the analysis of nonlinear structure response ranges when compared to parallel ABAQUS. The performance and efficiency of this algorithm has been verified with numerical simulations of a 200-metre 50-story 10-frame 10-bay 3D structure subjected to various load levels.
BibTeX:
@article{Sayegh2020,
  author = {Ammar T. Al Sayegh and Elisa D. Sotelino},
  title = {A new row-wise parallel finite element analysis algorithm with dynamic load balancing},
  journal = {International Journal of Earthquake and Impact Engineering},
  publisher = {Inderscience Publishers},
  year = {2020},
  volume = {3},
  number = {2},
  pages = {120},
  doi = {10.1504/ijeie.2020.108588}
}
Schäfer F, Katzfuss M and Owhadi H (2020), "Sparse Cholesky factorization by Kullback-Leibler minimization", April, 2020.
Abstract: We propose to compute a sparse approximate inverse Cholesky factor L of a dense covariance matrix Θ by minimizing the Kullback-Leibler divergence between the Gaussian distributions N(0, ) and N(0, L^-top L^-1), subject to a sparsity constraint. Surprisingly, this problem has a closed-form solution that can be computed efficiently, recovering the popular Vecchia approximation in spatial statistics. Based on recent results on the approximate sparsity of inverse Cholesky factors of Θ obtained from pairwise evaluation of Green's functions of elliptic boundary-value problems at points x_{i}\_1 ≤ i ≤ N ⊂ &reals;^d, we propose an elimination ordering and sparsity pattern that allows us to compute 𝜖-approximate inverse Cholesky factors of such Θ in computational complexity &Oscr;(N (N/)^d) in space and &Oscr;(N (N/)^2d) in time. To the best of our knowledge, this is the best asymptotic complexity for this class of problems. Furthermore, our method is embarrassingly parallel, automatically exploits low-dimensional structure in the data, and can perform Gaussian-process regression in linear (in N) space complexity. Motivated by the optimality properties of our methods, we propose methods for applying it to the joint covariance of training and prediction points in Gaussian-process regression, greatly improving stability and computational cost. Finally, we show how to apply our method to the important setting of Gaussian processes with additive noise, sacrificing neither accuracy nor computational complexity.
BibTeX:
@article{Schaefer2020,
  author = {Florian Schäfer and Matthias Katzfuss and Houman Owhadi},
  title = {Sparse Cholesky factorization by Kullback-Leibler minimization},
  year = {2020}
}
Schenker C, Cohen JE and Acar E (2020), "A Flexible Optimization Framework for Regularized Matrix-Tensor Factorizations with Linear Couplings", July, 2020.
Abstract: Coupled matrix and tensor factorizations (CMTF) are frequently used to jointly analyze data from multiple sources, also called data fusion. However, different characteristics of datasets stemming from multiple sources pose many challenges in data fusion and require to employ various regularizations, constraints, loss functions and different types of coupling structures between datasets. In this paper, we propose a flexible algorithmic framework for coupled matrix and tensor factorizations which utilizes Alternating Optimization (AO) and the Alternating Direction Method of Multipliers (ADMM). The framework facilitates the use of a variety of constraints, loss functions and couplings with linear transformations in a seamless way. Numerical experiments on simulated and real datasets demonstrate that the proposed approach is accurate, and computationally efficient with comparable or better performance than available CMTF methods for Frobenius norm loss, while being more flexible. Using Kullback-Leibler divergence on count data, we demonstrate that the algorithm yields accurate results also for other loss functions.
BibTeX:
@article{Schenker2020,
  author = {Carla Schenker and Jeremy E. Cohen and Evrim Acar},
  title = {A Flexible Optimization Framework for Regularized Matrix-Tensor Factorizations with Linear Couplings},
  year = {2020}
}
Schmidt D (2020), "A Survey of Singular Value Decomposition Methods for Distributed Tall/Skinny Data", September, 2020.
Abstract: The Singular Value Decomposition (SVD) is one of the most important matrix factorizations, enjoying a wide variety of applications across numerous application domains. In statistics and data analysis, the common applications of SVD such as Principal Components Analysis (PCA) and linear regression. Usually these applications arise on data that has far more rows than columns, so-called "tall/skinny" matrices. In the big data analytics context, this may take the form of hundreds of millions to billions of rows with only a few hundred columns. There is a need, therefore, for fast, accurate, and scalable tall/skinny SVD implementations which can fully utilize modern computing resources. To that end, we present a survey of three different algorithms for computing the SVD for these kinds of tall/skinny data layouts using MPI for communication. We contextualize these with common big data analytics techniques, principally PCA. Finally, we present both CPU and GPU timing results from the Summit supercomputer, and discuss possible alternative approaches.
BibTeX:
@article{Schmidt2020,
  author = {Drew Schmidt},
  title = {A Survey of Singular Value Decomposition Methods for Distributed Tall/Skinny Data},
  year = {2020}
}
Scott J and Tůma M (2020), "A Null-Space Approach for Symmetric Saddle Point Systems With a Non Zero (2,2) Block", SIAM Journal on Scientific Computing.
Abstract: Null-space methods have long been used to solve large-scale symmetric saddle point systems of equations in which the k × k (2, 2) block is zero. This paper focuses on the case where the (2, 2) block is non zero. A novel null-space approach is proposed to transform the saddle point system into another symmetric saddle point system of the same order but with a zero (2, 2) block of order at most 2k. Success of any null-space approach is dependent on the construction of a suitable null-space basis. The not uncommon case of the off-diagonal block being a wide matrix that has far fewer rows than columns and that may be dense is considered. A number of approaches are explored with the aim of balancing stability of the transformed system with sparsity. Linear least squares problems that contain a small number of dense rows arising from practical applications are used to illustrate our ideas and to explore their potential for solving large-scale systems.
BibTeX:
@article{Scott2020,
  author = {Jennifer Scott and Miroslav Tůma},
  title = {A Null-Space Approach for Symmetric Saddle Point Systems With a Non Zero (2,2) Block},
  journal = {SIAM Journal on Scientific Computing},
  year = {2020},
  url = {https://www2.karlin.mff.cuni.cz/ mirektuma/ps/RAL-P-2020-003.pdf}
}
Scott J and Tůma M (2020), "A computational study of using black-box QR solvers for large-scale sparse-dense linear least squares problems", ACM Transactions on Mathematical Software.
Abstract: Large-scale overdetermined linear least squares problems arise in many practical applications, both as subproblems of nonlinear least squares problems and in their own right. One popular solution method is based on the backward stable QR factorization of the system matrix A. This paper focuses on sparse-dense linear least squares problems, that is, problems where A is sparse except from a small number of rows that are considered to be dense. For large-scale problems, the direct application of a QR solver will fail because of a lack of memory or will be unacceptably slow. We study a number of approaches for solving such problems using a sparse QR solver without modication. We consider the case where the sparse part of A is rank-decient and show that either preprocessing A using partial matrix stretching or using regularization and employing a direct-iterative approach can be seamlessly combined with a black-box QR solver. Furthermore, we propose extending the augmented system formulation with iterative renement for sparse problems to sparse-dense problems and demonstrate experimentally that multi-precision variants can be successfully used.
BibTeX:
@article{Scott2020a,
  author = {J. Scott and M Tůma},
  title = {A computational study of using black-box QR solvers for large-scale sparse-dense linear least squares problems},
  journal = {ACM Transactions on Mathematical Software},
  year = {2020},
  url = {http://purl.org/net/epubs/manifestation/47616417/RAL-P-2020-004.pdf}
}
Selvitopi O, Hussain MT, Azad A and Buluç A (2020), "Optimizing High Performance Markov Clustering for Pre-Exascale Architectures", 34th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020., February, 2020.
Abstract: HipMCL is a high-performance distributed memory implementation of the popular Markov Cluster Algorithm (MCL) and can cluster large-scale networks within hours using a few thousand CPU-equipped nodes. It relies on sparse matrix computations and heavily makes use of the sparse matrix-sparse matrix multiplication kernel (SpGEMM). The existing parallel algorithms in HipMCL are not scalable to Exascale architectures, both due to their communication costs dominating the runtime at large concurrencies and also due to their inability to take advantage of accelerators that are increasingly popular. In this work, we systematically remove scalability and performance bottlenecks of HipMCL. We enable GPUs by performing the expensive expansion phase of the MCL algorithm on GPU. We propose a CPU-GPU joint distributed SpGEMM algorithm called pipelined Sparse SUMMA and integrate a probabilistic memory requirement estimator that is fast and accurate. We develop a new merging algorithm for the incremental processing of partial results produced by the GPUs, which improves the overlap efficiency and the peak memory usage. We also integrate a recent and faster algorithm for performing SpGEMM on CPUs. We validate our new algorithms and optimizations with extensive evaluations. With the enabling of the GPUs and integration of new algorithms, HipMCL is up to 12.4x faster, being able to cluster a network with 70 million proteins and 68 billion connections just under 15 minutes using 1024 nodes of ORNL's Summit supercomputer.
BibTeX:
@article{Selvitopi2020,
  author = {Oguz Selvitopi and Md Taufique Hussain and Ariful Azad and Aydın Buluç},
  title = {Optimizing High Performance Markov Clustering for Pre-Exascale Architectures},
  journal = {34th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020},
  year = {2020}
}
Selvitopi O, Ekanayake S, Guidi G, Pavlopoulos G, Azad A and Buluc A (2020), "Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices", September, 2020.
Abstract: Identifying similar protein sequences is a core step in many computational biology pipelines such as detection of homologous protein sequences, generation of similarity protein graphs for downstream analysis, functional annotation and gene location. Performance and scalability of protein similarity searches have proven to be a bottleneck in many bioinformatics pipelines due to increases in cheap and abundant sequencing data. This work presents a new distributed-memory software, PASTIS. PASTIS relies on sparse matrix computations for efficient identification of possibly similar proteins. We use distributed sparse matrices for scalability and show that the sparse matrix infrastructure is a great fit for protein similarity searches when coupled with a fully-distributed dictionary of sequences that allows remote sequence requests to be fulfilled. Our algorithm incorporates the unique bias in amino acid sequence substitution in searches without altering the basic sparse matrix model, and in turn, achieves ideal scaling up to millions of protein sequences.
BibTeX:
@article{Selvitopi2020a,
  author = {Oguz Selvitopi and Saliya Ekanayake and Giulia Guidi and Georgios Pavlopoulos and Ariful Azad and Aydin Buluc},
  title = {Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices},
  year = {2020}
}
di Serafino D, Toraldo G and Viola M (2020), "Using gradient directions to get global convergence of Newton-type methods", April, 2020.
Abstract: The renewed interest in Steepest Descent (SD) methods following the work of Barzilai and Borwein [IMA Journal of Numerical Analysis, 8 (1988)] has driven us to consider a globalization strategy based on SD, which is applicable to any line-search method. In particular, we combine Newton-type directions with scaled SD steps to have suitable descent directions. Scaling the SD directions with a suitable step length makes a significant difference with respect to similar globalization approaches, in terms of both theoretical features and computational behavior. We apply our strategy to Newton's method and the BFGS method, with computational results that appear interesting compared with the results of well-established globalization strategies devised ad hoc for those methods.
BibTeX:
@article{Serafino2020,
  author = {Daniela di Serafino and Gerardo Toraldo and Marco Viola},
  title = {Using gradient directions to get global convergence of Newton-type methods},
  year = {2020}
}
Sewall J, Pennycook SJ, Jacobsen D, Deakin T and McIntosh-Smith S (2020), "Interpreting and Visualizing Performance Portability Metrics", Proceedings of the 2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC.
Abstract: Recent work has introduced a number of tools and techniques for reasoning about the interplay between application performance and portability, or “performance portability”. These tools have proven useful for setting goals and guiding highlevel discussions, but our understanding of the performance portability problem remains incomplete. Different views of the same performance efficiency data offer different insights into an application's performance portability (or lack thereof): standard statistical measures such as the mean and standard deviation require careful interpretation, and even metrics designed specifically to measure performance portability may obscure differences between applications.\ This paper offers a critical assessment of existing approaches for summarizing performance efficiency data across different platforms, and proposes visualization as a means to extract useful information about the underlying distribution. We explore a number of alternative visualizations, outlining a new methodology that enables developers to reason about the performance portability of their applications and how it might be improved. This study unpicks what it might mean to be “performance portable” and provides useful tools to explore that question.
BibTeX:
@article{Sewall2020,
  author = {Jason Sewall and S. John Pennycook and Douglas Jacobsen and Tom Deakin and Simon McIntosh-Smith},
  title = {Interpreting and Visualizing Performance Portability Metrics},
  journal = {Proceedings of the 2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC},
  year = {2020}
}
Shi Z (2020), "SingCubic: Cyclic Incremental Newton-type Gradient Descent with Cubic Regularization for Non-Convex Optimization", February, 2020.
Abstract: In this work, we generalized and unified two recent completely different works of [] and [] respectively into one by proposing the cyclic incremental Newton-type gradient descent with cubic regularization (SingCubic) method for optimizing non-convex functions. Through the iterations of SingCubic, a cubic regularized global quadratic approximation using Hessian information is kept and solved. Preliminary numerical experiments show the encouraging performance of the SingCubic algorithm when compared to basic incremental or stochastic Newton-type implementations. The results and technique can be served as an initiate for the research on the incremental Newton-type gradient descent methods that employ cubic regularization. The methods and principles proposed in this paper can be used to do logistic regression, autoencoder training, independent components analysis, Ising model/Hopfield network training, multilayer perceptron, deep convolutional network training and so on. We will open-source parts of our implementations soon.
BibTeX:
@article{Shi2020,
  author = {Ziqiang Shi},
  title = {SingCubic: Cyclic Incremental Newton-type Gradient Descent with Cubic Regularization for Non-Convex Optimization},
  year = {2020}
}
Simonetto A, Dall'Anese E, Paternain S, Leus G and Giannakis GB (2020), "Time-Varying Convex Optimization: Time-Structured Algorithms and Applications", June, 2020.
Abstract: Optimization underpins many of the challenges that science and technology face on a daily basis. Recent years have witnessed a major shift from traditional optimization paradigms grounded on batch algorithms for medium-scale problems to challenging dynamic, time-varying, and even huge-size settings. This is driven by technological transformations that converted infrastructural and social platforms into complex and dynamic networked systems with even pervasive sensing and computing capabilities. The present paper reviews a broad class of state-of-the-art algorithms for time-varying optimization, with an eye to both algorithmic development and performance analysis. It offers a comprehensive overview of available tools and methods, and unveils open challenges in application domains of broad interest. The real-world examples presented include smart power systems, robotics, machine learning, and data analytics, highlighting domain-specific issues and solutions. The ultimate goal is to exempify wide engineering relevance of analytical tools and pertinent theoretical foundations.
BibTeX:
@article{Simonetto2020,
  author = {Andrea Simonetto and Emiliano Dall'Anese and Santiago Paternain and Geert Leus and Georgios B. Giannakis},
  title = {Time-Varying Convex Optimization: Time-Structured Algorithms and Applications},
  year = {2020}
}
Slota GM, Root C, Devine K, Madduri K and Rajamanickam S (2020), "Scalable, Multi-Constraint, Complex-Objective Graph Partitioning", IEEE Transactions on Parallel and Distributed Systems., 12, 2020. Vol. 31(12), pp. 2789-2801. Institute of Electrical and Electronics Engineers (IEEE).
Abstract: We introduce XtraPuLP , a distributed-memory graph partitioner designed to process irregular trillion-edge graphs. XtraPuLP is based on the scalable label propagation community detection technique, which has been demonstrated in various prior works as a viable means to produce high quality partitions of skewed and small-world graphs with minimal computation time. Our XtraPuLP implementation can also be generalized to compute partitions with an arbitrary number of constraints, and it can compute partitions with balanced communication load across all parts. On a collection of large sparse graphs, we show that XtraPuLP partitioning is considerably faster than state-of-the-art partitioning methods, while also demonstrating that XtraPuLP can produce partitions of real-world graphs with billion+ vertices and over a hundred billion edges in minutes. Additionally, we demonstrate XtraPuLP on a variety of applications, including large-scale graph analytics and sparse matrix-vector multiplication.
BibTeX:
@article{Slota2020,
  author = {George M. Slota and Cameron Root and Karen Devine and Kamesh Madduri and Sivasankaran Rajamanickam},
  title = {Scalable, Multi-Constraint, Complex-Objective Graph Partitioning},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
  year = {2020},
  volume = {31},
  number = {12},
  pages = {2789--2801},
  doi = {10.1109/tpds.2020.3002150}
}
Sofranac B, Gleixner A and Pokutta S (2020), "Accelerating Domain Propagation: an Efficient GPU-Parallel Algorithm over Sparse Matrices", September, 2020.
Abstract: Fast domain propagation of linear constraints has become a crucial component of today's best algorithms and solvers for mixed integer programming and pseudo-boolean optimization to achieve peak solving performance. Irregularities in the form of dynamic algorithmic behaviour, dependency structures, and sparsity patterns in the input data make efficient implementations of domain propagation on GPUs and, more generally, on parallel architectures challenging. This is one of the main reasons why domain propagation in state-of-the-art solvers is single thread only. In this paper, we present a new algorithm for domain propagation which (a) avoids these problems and allows for an efficient implementation on GPUs, and is (b) capable of running propagation rounds entirely on the GPU, without any need for synchronization or communication with the CPU. We present extensive computational results which demonstrate the effectiveness of our approach and show that ample speedups are possible on practically relevant problems: on state-of-the-art GPUs, our geometric mean speed-up for reasonably-large instances is around 10x to 20x and can be as high as 195x on favorably-large instances.
BibTeX:
@article{Sofranac2020,
  author = {Boro Sofranac and Ambros Gleixner and Sebastian Pokutta},
  title = {Accelerating Domain Propagation: an Efficient GPU-Parallel Algorithm over Sparse Matrices},
  year = {2020}
}
Solomonik E and Demmel J (2020), "Fast Bilinear Algorithms for Symmetric Tensor Contractions", Computational Methods in Applied Mathematics., 2, 2020. Walter de Gruyter GmbH.
Abstract: In matrix-vector multiplication, matrix symmetry does not permit a straightforward reduction in computational cost. More generally, in contractions of symmetric tensors, the symmetries are not preserved in the usual algebraic form of contraction algorithms. We introduce an algorithm that reduces the bilinear complexity (number of computed elementwise products) for most types of symmetric tensor contractions. In particular, it lowers the bilinear complexity of symmetrized contractions of symmetric tensors of order s+v and v+t by a factor of (s+t+v)!s!t!v! to leading order. The algorithm computes a symmetric tensor of bilinear products, then subtracts unwanted parts of its partial sums. Special cases of this algorithm provide improvements to the bilinear complexity of the multiplication of a symmetric matrix and a vector, the symmetrized vector outer product, and the symmetrized product of symmetric matrices. While the algorithm requires more additions for each elementwise product, the total number of operations is in some cases less than classical algorithms, for tensors of any size. We provide a round-off error analysis of the algorithm and demonstrate that the error is not too large in practice. Finally, we provide an optimized implementation for one variant of the symmetry-preserving algorithm, which achieves speedups of up to 4.58× for a particular tensor contraction, relative to a classical approach that casts the problem as a matrix-matrix multiplication.
BibTeX:
@article{Solomonik2020,
  author = {Edgar Solomonik and James Demmel},
  title = {Fast Bilinear Algorithms for Symmetric Tensor Contractions},
  journal = {Computational Methods in Applied Mathematics},
  publisher = {Walter de Gruyter GmbH},
  year = {2020},
  doi = {10.1515/cmam-2019-0075}
}
Soltaniyeh M, Martin RP and Nagarakatte S (2020), "Synergistic CPU-FPGA Acceleration of Sparse Linear Algebra", April, 2020.
Abstract: This paper describes REAP, a software-hardware approach that enables high performance sparse linear algebra computations on a cooperative CPU-FPGA platform. REAP carefully separates the task of organizing the matrix elements from the computation phase. It uses the CPU to provide a first-pass re-organization of the matrix elements, allowing the FPGA to focus on the computation. We introduce a new intermediate representation that allows the CPU to communicate the sparse data and the scheduling decisions to the FPGA. The computation is optimized on the FPGA for effective resource utilization with pipelining. REAP improves the performance of Sparse General Matrix Multiplication (SpGEMM) and Sparse Cholesky Factorization by 3.2X and 1.85X compared to widely used sparse libraries for them on the CPU, respectively.
BibTeX:
@article{Soltaniyeh2020,
  author = {Mohammadreza Soltaniyeh and Richard P. Martin and Santosh Nagarakatte},
  title = {Synergistic CPU-FPGA Acceleration of Sparse Linear Algebra},
  year = {2020}
}
Soltan Mohammadi M (2020), "Automatic Sparse Computation Parallelization By Utilizing Domain-Specific Knowledge In Data Dependence Analysis". Thesis at: Department of Computer Science, The University of Arizona.
Abstract: Sparse vectors, matrices, and tensors are commonly used to compress nonzero values of big data manipulated in data analytics, scientific simulations, and machine learning computations. As with general computations, parallelization of loops in sparse computations, codes manipulating sparse structures, is essential to efficiently utilize available parallel architectures. The sparse computations often exhibit partial parallelism in loops that are sequential in the corresponding dense computation due to sparsity of data dependencies coming from indirect memory access through index arrays, e.g. col in val[col[j]]. Such dependencies can only be discovered at runtime when content of index arrays are available. Consequently, performance programmers typically use the inspector/executor strategy to take advantage of partial parallelism in sparse computation. There, programmers implement an inspector code that creates iteration dependency graph at runtime from which wavefronts of iterations are extracted and fed into a parallel version of the computation called an executor. The executor executes iteration waves sequentially to respect sparse dependencies while executing iterations inside each wavefront in parallel.\ To automate the generation of the inspector and executor code, compiler-based loop-carried data dependency analysis is needed. However, straightforward automatically generated inspectors typically have significantly higher overhead than hand written optimized ones. Consequently, the specific problem that I am addressing in this dissertation is how can we automate the strategies used by expert programmers to generate efficient runtime inspectors for parallelizing sparse computation.\ The overarching contribution of this dissertation is an approach for encoding index array properties for individual index arrays and relationships between index arrays as universally quantified constraints and using them in compiler-based data dependence analysis. The dependence analysis is then evaluated in the context of finding wavefront parallelism in sparse computations. More specifically, one contribution is an approach to automatically use index array properties to prove more data dependencies unsatisfiable, removing the need for inspecting them at runtime. Other contributions are methods to use the same properties to simplify compiletime-satisfiable dependences by finding equalities and subset relationships enabling generation of faster runtime inspectors. The last contribution includes compile-time methods for expanding opportunities for array privatization in sparse computations by defining an array as private if its contents start and end each iteration with the same value. Evaluation results show my approach is able to find seven fully parallel loops in seven sparse computations where previous compiler-based approach could not, and efficiently extract partial parallelism from outer most loops of five out of six sparse computations.
BibTeX:
@phdthesis{SoltanMohammadi2020,
  author = {Soltan Mohammadi, Mahdi},
  title = {Automatic Sparse Computation Parallelization By Utilizing Domain-Specific Knowledge In Data Dependence Analysis},
  school = {Department of Computer Science, The University of Arizona},
  year = {2020}
}
Song Y, Meng C, Liao R and Ermon S (2020), "Nonlinear Equation Solving: A Faster Alternative to Feedforward Computation", February, 2020.
Abstract: Feedforward computations, such as evaluating a neural network or sampling from an autoregressive model, are ubiquitous in machine learning. The sequential nature of feedforward computation, however, requires a strict order of execution and cannot be easily accelerated with parallel computing. To enable parrallelization, we frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point iteration method, as well as hybrid methods of both. Crucially, Jacobi updates operate independently on each equation and can be executed in parallel. Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallel iterations. Experimentally, we demonstrate the effectiveness of our approach in accelerating 1) the evaluation of DenseNets on ImageNet and 2) autoregressive sampling of MADE and PixelCNN. We are able to achieve between 1.2 and 33 speedup factors under various conditions and computation models.
BibTeX:
@article{Song2020,
  author = {Yang Song and Chenlin Meng and Renjie Liao and Stefano Ermon},
  title = {Nonlinear Equation Solving: A Faster Alternative to Feedforward Computation},
  year = {2020}
}
Spiteri P (2020), "Parallel asynchronous algorithms: A survey", Advances in Engineering Software., 11, 2020. Vol. 149, pp. 102896. Elsevier BV.
Abstract: This paper deals with a synthetic presentation of parallel iterative asynchronous algorithms and their extensions for the solution of large sparse linear or pseudo-linear algebraic systems eventually constrained. The behavior of these iterative parallel asynchronous algorithms is studied by three distinct methods : contraction property, partial ordering property linked to the discrete maximum principle and nested sets; the link between these three kinds of analysis is presented. Stopping tests of the iterations are presented both from computer science and from numerical analysis approach including in this last case approximate contraction property, partial ordering property linked to the discrete maximum principle and nested sets. The principle of implementation of these parallel asynchronous iterative methods is described for subdomain method without overlapping and for subdomain method with overlapping; the use of load balancing approach for asynchronous parallel algorithms is also discussed. Various applications modelled by linear equations or pseudo linear equations and solved by such parallel algorithms are presented as well as the uses of these methods in computer security and Boolean calculation. The efficiency of parallel iterative asynchronous algorithms is also discussed.
BibTeX:
@article{Spiteri2020,
  author = {Pierre Spiteri},
  title = {Parallel asynchronous algorithms: A survey},
  journal = {Advances in Engineering Software},
  publisher = {Elsevier BV},
  year = {2020},
  volume = {149},
  pages = {102896},
  doi = {10.1016/j.advengsoft.2020.102896}
}
Squires C, Amaniampong J and Uhler C (2020), "Efficient Permutation Discovery in Causal DAGs", November, 2020.
Abstract: The problem of learning a directed acyclic graph (DAG) up to Markov equivalence is equivalent to the problem of finding a permutation of the variables that induces the sparsest graph. Without additional assumptions, this task is known to be NP-hard. Building on the minimum degree algorithm for sparse Cholesky decomposition, but utilizing DAG-specific problem structure, we introduce an efficient algorithm for finding such sparse permutations. We show that on jointly Gaussian distributions, our method with depth w runs in O(p^w+3) time. We compare our method with w = 1 to algorithms for finding sparse elimination orderings of undirected graphs, and show that taking advantage of DAG-specific problem structure leads to a significant improvement in the discovered permutation. We also compare our algorithm to provably consistent causal structure learning algorithms, such as the PC algorithm, GES, and GSP, and show that our method achieves comparable performance with a shorter runtime. Thus, our method can be used on its own for causal structure discovery. Finally, we show that there exist dense graphs on which our method achieves almost perfect performance, so that unlike most existing causal structure learning algorithms, the situations in which our algorithm achieves both good performance and good runtime are not limited to sparse graphs.
BibTeX:
@article{Squires2020,
  author = {Chandler Squires and Joshua Amaniampong and Caroline Uhler},
  title = {Efficient Permutation Discovery in Causal DAGs},
  year = {2020}
}
Srinivasa RS, Davenport MA and Romberg J (2020), "Localized sketching for matrix multiplication and ridge regression", March, 2020.
Abstract: We consider sketched approximate matrix multiplication and ridge regression in the novel setting of localized sketching, where at any given point, only part of the data matrix is available. This corresponds to a block diagonal structure on the sketching matrix. We show that, under mild conditions, block diagonal sketching matrices require only O(sr2) and O(sd_\lambda𝜖) total sample complexity for matrix multiplication and ridge regression, respectively. This matches the state-of-the-art bounds that are obtained using global sketching matrices. The localized nature of sketching considered allows for different parts of the data matrix to be sketched independently and hence is more amenable to computation in distributed and streaming settings and results in a smaller memory and computational footprint.
BibTeX:
@article{Srinivasa2020,
  author = {Rakshith S Srinivasa and Mark A Davenport and Justin Romberg},
  title = {Localized sketching for matrix multiplication and ridge regression},
  year = {2020}
}
Steinerberger S (2020), "A Spectral Approach to the Shortest Path Problem", April, 2020.
Abstract: Let G=(V,E) be a simple, connected graph. One is often interested in a short path between two vertices u,v. We propose a spectral algorithm: construct the function :V → &reals;_≥ 0, φ = arg_f:V → &reals; atop f(u) = 0, f not≡ 0 \sum_{(w_1, w_2) \in E}{(f(w_1)-f(w_2))^2}_w ∊ Vf(w)^2. φ can also be understood as the smallest eigenvector of the Laplacian Matrix L=D-A after the u-th row and column have been removed. We start in the point v and construct a path from v to u: at each step, we move to the neighbor for which φ is the smallest. This algorithm provably terminates and results in a short path from v to u, often the shortest. The efficiency of this method is due to a discrete analogue of a phenomenon in Partial Differential Equations that is not well understood. We prove optimality for trees and discuss a number of open questions.
BibTeX:
@article{Steinerberger2020,
  author = {Stefan Steinerberger},
  title = {A Spectral Approach to the Shortest Path Problem},
  year = {2020}
}
Stonyakin F, Tyurin A, Gasnikov A, Dvurechensky P, Agafonov A, Dvinskikh D, Pasechnyuk D, Artamonov S and Piskunova V (2020), "Inexact Relative Smoothness and Strong Convexity for Optimization and Variational Inequalities by Inexact Model", January, 2020.
Abstract: In this paper we propose a general algorithmic framework for first-order methods in optimization in a broad sense, including minimization problems, saddle-point problems and variational inequalities. This framework allows to obtain many known methods as a special case, the list including accelerated gradient method, composite optimization methods, level-set methods, Bregman proximal methods. The idea of the framework is based on constructing an inexact model of the main problem component, i.e. objective function in optimization or operator in variational inequalities. Besides reproducing known results, our framework allows to construct new methods, which we illustrate by constructing a universal conditional gradient method and universal method for variational inequalities with composite structure. These method works for smooth and non-smooth problems with optimal complexity without a priori knowledge of the problem smoothness. As a particular case of our general framework, we introduce relative smoothness for operators and propose an algorithm for VIs with such operator. We also generalize our framework for relatively strongly convex objectives and strongly monotone variational inequalities. This paper is an extended and updated version of [arXiv:1902.00990]. In particular, we add an extension of relative strong convexity for optimization and variational inequalities.
BibTeX:
@article{Stonyakin2020,
  author = {Fedor Stonyakin and Alexander Tyurin and Alexander Gasnikov and Pavel Dvurechensky and Artem Agafonov and Darina Dvinskikh and Dmitry Pasechnyuk and Sergei Artamonov and Victorya Piskunova},
  title = {Inexact Relative Smoothness and Strong Convexity for Optimization and Variational Inequalities by Inexact Model},
  year = {2020}
}
Su J, Zhang F, Liu W, He B, Wu R, Du X and Wang R (2020), "CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs", In Proceedings of the 49th International Conference on Parallel Processing., 8, 2020. ACM.
Abstract: Sparse triangular solves (SpTRSVs) have been extensively used in linear algebra fields, and many GPU-based SpTRSV algorithms have been proposed. Synchronization-free SpTRSVs, due to their short preprocessing time and high performance, are currently the most popular SpTRSV algorithms. However, we observe that the performance of those SpTRSV algorithms on different matrices can vary greatly by 845 times. Our further studies show that when the average number of components per level is high and the average number of nonzero elements per row is low, those SpTRSVs exhibit extremely low performance. The reason is that, they use a warp on the GPU to process a row in sparse matrices, and such warp-level designs have severe underutilization of the GPU. To solve this problem, we propose CapelliniSpTRSV, a thread-level synchronization-free SpTRSV algorithm. Particularly, CapelliniSpTRSV has three novel features. First, unlike the previous studies, CapelliniSpTRSV does not need preprocessing to calculate levels. Second, CapelliniSpTRSV exhibits high performance on matrices that previous SpTRSVs cannot handle efficiently. Third, CapelliniSpTRSV's optimization does not rely on specific sparse matrix storage format. Instead, it can achieve very good performance on the most popular sparse matrix storage, compressed sparse row (CSR) format, and thus users do not need to conduct format conversion. We evaluate CapelliniSpTRSV with 245 matrices from the Florida Sparse Matrix Collection on three GPU platforms, and experiments show that our SpTRSV exhibits 6.84 GFLOPS/s, which is 4.97× speedup over the state-of-the-art synchronization-free SpTRSV algorithm, and 4.74× speedup over the SpTRSV in cuSPARSE. CapelliniSpTRSV is open-sourced in https://github.com/JiyaSu/CapelliniSpTRSV.
BibTeX:
@inproceedings{Su2020,
  author = {Jiya Su and Feng Zhang and Weifeng Liu and Bingsheng He and Ruofan Wu and Xiaoyong Du and Rujia Wang},
  title = {CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs},
  booktitle = {Proceedings of the 49th International Conference on Parallel Processing},
  publisher = {ACM},
  year = {2020},
  doi = {10.1145/3404397.3404400}
}
Sun J, Sun G, Zhan S, Zhang J and Chen Y (2020), "Automated Performance Modeling of HPC Applications Using Machine Learning", IEEE Transactions on Computers.
Abstract: Automated performance modeling and performance prediction of parallel programs are highly valuable in many use cases, such as in guiding task management and job scheduling, offering insights of application behaviors, and assisting resource requirement estimation. The performance of parallel programs is affected by numerous factors, including but not limited to hardware, applications, algorithms, and input parameters, thus an accurate performance prediction is often a challenging task. In this study, we focus on automatically predicting the execution time of parallel programs with different inputs, at different scale, and without domain knowledge. We model the correlation between the execution time and domain-independent runtime features. These features include values of variables, counters of branches, loops, and MPI communications. After collecting data from executions with different inputs, a random forest machine learning approach is used to build an empirical performance model, which can predict the execution time of the program given an input. An instance-transfer learning method is used to reuse an existing model and improve the prediction on a new platform that lacks historical execution data. Our experiments and analyses of three parallel applications on three different systems confirm that our method performs well, with less than 20% prediction error on average.
BibTeX:
@article{Sun2020,
  author = {J. Sun and G. Sun and S. Zhan and J. Zhang and Y. Chen},
  title = {Automated Performance Modeling of HPC Applications Using Machine Learning},
  journal = {IEEE Transactions on Computers},
  year = {2020},
  doi = {10.1109/TC.2020.2964767}
}
Sundar K, Nagarajan H, Wang S, Linderoth J and Bent R (2020), "Piecewise Polyhedral Formulations for a Multilinear Term", January, 2020.
Abstract: In this paper, we present a mixed-integer linear programming formulation of a piecewise, polyhedral relaxation (PPR) of a multilinear term using it's convex hull representation. Based on the solution of the PPR, we also present a MIP-based piecewise formulation which restricts the solutions to be feasible for the multilinear term. We then present computational results showing the effectiveness of proposed formulations on instances from the standard Mixed-Integer Nonlinear Programming Library (MINLPLib) and compare the proposed formulation with a formulation that is built by recursively relaxing bilinear groupings of the multilinear term, typically applied in the literature.
BibTeX:
@article{Sundar2020,
  author = {Kaarthik Sundar and Harsha Nagarajan and Site Wang and Jeff Linderoth and Russell Bent},
  title = {Piecewise Polyhedral Formulations for a Multilinear Term},
  year = {2020}
}
Swenson N, Krishnapriyan AS, Buluc A, Morozov D and Yelick K (2020), "PersGNN: Applying Topological Data Analysis and Geometric Deep Learning to Structure-Based Protein Function Prediction", October, 2020.
Abstract: Understanding protein structure-function relationships is a key challenge in computational biology, with applications across the biotechnology and pharmaceutical industries. While it is known that protein structure directly impacts protein function, many functional prediction tasks use only protein sequence. In this work, we isolate protein structure to make functional annotations for proteins in the Protein Data Bank in order to study the expressiveness of different structure-based prediction schemes. We present PersGNN - an end-to-end trainable deep learning model that combines graph representation learning with topological data analysis to capture a complex set of both local and global structural features. While variations of these techniques have been successfully applied to proteins before, we demonstrate that our hybridized approach, PersGNN, outperforms either method on its own as well as a baseline neural network that learns from the same information. PersGNN achieves a 9.3% boost in area under the precision recall curve (AUPR) compared to the best individual model, as well as high F1 scores across different gene ontology categories, indicating the transferability of this approach.
BibTeX:
@article{Swenson2020,
  author = {Nicolas Swenson and Aditi S. Krishnapriyan and Aydin Buluc and Dmitriy Morozov and Katherine Yelick},
  title = {PersGNN: Applying Topological Data Analysis and Geometric Deep Learning to Structure-Based Protein Function Prediction},
  year = {2020}
}
Świrydowicz K, Langou J, Ananthan S, Yang U and Thomas S (2020), "Low synchronization Gram–Schmidt and generalized minimal residual algorithms", Numerical Linear Algebra with Applications., 10, 2020. Wiley.
Abstract: The Gram–Schmidt process uses orthogonal projection to construct the A = QR factorization of a matrix. When Q has linearly independent columns, the operator P = I - Q(Q^T Q)^-1 Q^T defines an orthogonal projection onto Q^bot. In finite precision, Q loses orthogonality as the factorization progresses. A family of approximate projections is derived with the form P = I - QT Q^T, with correction matrix T. When T = (Q^T Q)^-1, and T is triangular, it is postulated that the best achievable orthogonality is &Oscr;()(A). We present new variants of modified (MGS) and classical Gram–Schmidt algorithms that require one global reduction step. An interesting form of the projector leads to a compact WY representation for MGS. In particular, the inverse compact WY MGS algorithm is equivalent to a lower triangular solve. Our main contribution is to introduce a backward normalization lag into the compact WY representation, resulting in a &Oscr;()([r_0, A V_m]) stable Generalized Minimal Residual Method (GMRES) algorithm that requires only one global reduce per iteration. Further improvements in performance are achieved by accelerating GMRES on GPUs.
BibTeX:
@article{Swirydowicz2020,
  author = {Katarzyna Świrydowicz and Julien Langou and Shreyas Ananthan and Ulrike Yang and Stephen Thomas},
  title = {Low synchronization Gram–Schmidt and generalized minimal residual algorithms},
  journal = {Numerical Linear Algebra with Applications},
  publisher = {Wiley},
  year = {2020},
  doi = {10.1002/nla.2343}
}
Tan C, Qian Y, Ma S and Zhang T (2020), "Accelerated Dual-Averaging Primal-Dual Method for Composite Convex Minimization", Optimization Methods and Software 2020., January, 2020.
Abstract: Dual averaging-type methods are widely used in industrial machine learning applications due to their ability to promoting solution structure (e.g., sparsity) efficiently. In this paper, we propose a novel accelerated dual-averaging primal-dual algorithm for minimizing a composite convex function. We also derive a stochastic version of the proposed method which solves empirical risk minimization, and its advantages on handling sparse data are demonstrated both theoretically and empirically.
BibTeX:
@article{Tan2020,
  author = {Conghui Tan and Yuqiu Qian and Shiqian Ma and Tong Zhang},
  title = {Accelerated Dual-Averaging Primal-Dual Method for Composite Convex Minimization},
  journal = {Optimization Methods and Software 2020},
  year = {2020},
  doi = {10.1080/10556788.2020.1713779}
}
Tan C, Xie C, Marquez A, Tumeo A, Barker K and Li A (2020), "ARENA: Asynchronous Reconfigurable Accelerator Ring to Enable Data-Centric Parallel Computing", November, 2020.
Abstract: The next generation HPC and data centers are likely to be reconfigurable and data-centric due to the trend of hardware specialization and the emergence of data-driven applications. In this paper, we propose ARENA -- an asynchronous reconfigurable accelerator ring architecture as a potential scenario on how the future HPC and data centers will be like. Despite using the coarse-grained reconfigurable arrays (CGRAs) as the substrate platform, our key contribution is not only the CGRA-cluster design itself, but also the ensemble of a new architecture and programming model that enables asynchronous tasking across a cluster of reconfigurable nodes, so as to bring specialized computation to the data rather than the reverse. We presume distributed data storage without asserting any prior knowledge on the data distribution. Hardware specialization occurs at runtime when a task finds the majority of data it requires are available at the present node. In other words, we dynamically generate specialized CGRA accelerators where the data reside. The asynchronous tasking for bringing computation to data is achieved by circulating the task token, which describes the data-flow graphs to be executed for a task, among the CGRA cluster connected by a fast ring network. Evaluations on a set of HPC and data-driven applications across different domains show that ARENA can provide better parallel scalability with reduced data movement (53.9%). Compared with contemporary compute-centric parallel models, ARENA can bring on average 4.37× speedup. The synthesized CGRAs and their task-dispatchers only occupy 2.93mm^2 chip area under 45nm process technology and can run at 800MHz with on average 759.8mW power consumption. ARENA also supports the concurrent execution of multi-applications, offering ideal architectural support for future high-performance parallel computing and data analytics systems.
BibTeX:
@article{Tan2020a,
  author = {Cheng Tan and Chenhao Xie and Andres Marquez and Antonino Tumeo and Kevin Barker and Ang Li},
  title = {ARENA: Asynchronous Reconfigurable Accelerator Ring to Enable Data-Centric Parallel Computing},
  year = {2020}
}
Tanaka Y, Eldar YC, Ortega A and Cheung G (2020), "Sampling on Graphs: From Theory to Applications", March, 2020.
Abstract: The study of sampling signals on graphs, with the goal of building an analog of sampling for standard signals in the time and spatial domains, has attracted considerable attention recently. Beyond adding to the growing theory on graph signal processing (GSP), sampling on graphs has various promising applications. In this article, we review current progress on sampling over graphs focusing on theory and potential applications. Most methodologies used in graph signal sampling are designed to parallel those used in sampling for standard signals, however, sampling theory for graph signals significantly differs from that for Shannon--Nyquist and shift invariant signals. This is due in part to the fact that the definitions of several important properties, such as shift invariance and bandlimitedness, are different in GSP systems. Throughout, we discuss similarities and differences between standard and graph sampling and highlight open problems and challenges.
BibTeX:
@article{Tanaka2020,
  author = {Yuichi Tanaka and Yonina C. Eldar and Antonio Ortega and Gene Cheung},
  title = {Sampling on Graphs: From Theory to Applications},
  year = {2020}
}
Tandon S, Marincic I, Hoffmann H and Johnsen E (2020), "Enabling power-performance balance with transprecision calculations for extreme-scale computations of turbulent flows", In Proceedings of the 2020 AIAA Aviation Forum., 6, 2020. American Institute of Aeronautics and Astronautics.
Abstract: In modern scientific computing, the execution of floating-point operations emerges as a major contributor to the energy consumption of a compute-intensive application with a large dynamic range. Experimental evidence shows that over 50% of the energy consumed by a core and its data memory is related to floating-point computations. The adoption of floating-point formats requiring lesser number of bits is an interesting opportunity to reduce the energy consumption as it allows simplification of the arithmetic circuitry and reduces the memory bandwidth required to transfer the data between memory and registers. In theory, the adoption of multiple floating-point types following the principle of transprecision computing allows fine-grained control of floating-point arithmetic while meeting the desired standards on the accuracy of the final result. In this paper, the power-performance trade-offs for computing at different precision levels are analyzed for a parallel and distributed framework based on recovery-assisted discontinuous Galerkin (RADG) methods. The recovery operator of the RADG, operates on compact support from neighboring elements and allows high-order approximation of the solution, with potential for massive parallelism. Using PoLiMEr – a power monitoring and management tool for HPC applications – fine-grained insights into the power characteristics of the RADG code on the supercomputer Theta at Argonne National Laboratory are presented. 3D benchmark tests indicate a savings of approximately 5 W per node with single precision computing. A mixed precision approach where all computations except recovery operation is performed in single precision shows promising results, however, an automated approach for tuning floating-point types and analyzing the floating-point sensitivity of variables and operations is desirable.
BibTeX:
@inproceedings{Tandon2020,
  author = {Suyash Tandon and Ivana Marincic and Henry Hoffmann and Eric Johnsen},
  title = {Enabling power-performance balance with transprecision calculations for extreme-scale computations of turbulent flows},
  booktitle = {Proceedings of the 2020 AIAA Aviation Forum},
  publisher = {American Institute of Aeronautics and Astronautics},
  year = {2020},
  doi = {10.2514/6.2020-2922}
}
Tang W and Daoutidis P (2020), "Fast and Stable Nonconvex Constrained Distributed Optimization: The ELLADA Algorithm", April, 2020.
Abstract: Distributed optimization, where the computations are performed in a localized and coordinated manner using multiple agents, is a promising approach for solving large-scale optimization problems, e.g., those arising in model predictive control (MPC) of large-scale plants. However, a distributed optimization algorithm that is computationally efficient, globally convergent, amenable to nonconvex constraints and general inter-subsystem interactions remains an open problem. In this paper, we combine three important modifications to the classical alternating direction method of multipliers (ADMM) for distributed optimization. Specifically, (i) an extra-layer architecture is adopted to accommodate nonconvexity and handle inequality constraints, (ii) equality-constrained nonlinear programming (NLP) problems are allowed to be solved approximately, and (iii) a modified Anderson acceleration is employed for reducing the number of iterations. Theoretical convergence towards stationary solutions and computational complexity of the proposed algorithm, named ELLADA, is established. Its application to distributed nonlinear MPC is also described and illustrated through a benchmark process system.
BibTeX:
@article{Tang2020,
  author = {Wentao Tang and Prodromos Daoutidis},
  title = {Fast and Stable Nonconvex Constrained Distributed Optimization: The ELLADA Algorithm},
  year = {2020}
}
Tang M (2020), "Performance Optimization for Sparse Matrix Factorization ALgorithms on Hybrid Multicore Architectures". Thesis at: University of Florida.
Abstract: The use of sparse direct methods in computational science is ubiquitous. Direct methods can be used to find solutions to many numerical algebra applications, including sparse linear systems, sparse linear least squares, and eigenvalue problems; consequently they form the backbone of a broad spectrum of large scale applications. The use of sparse direct methods is extensive, with many of the relevant science and engineering application areas being pushed to run at ever higher scales.\ In this work we delve into the implementations of sparse direct methods including the sparse Cholesky, QR, and LU factorizations. We research on a number of state-of-the-art libraries for sparse matrix factorizations, and improve their performance by applying various optimizations.\ For the sparse Cholesky factorization we have implemented multithreading, pipelining, the multilevel subtree method, and batched factorization. For the sparse QR factorization we implemented pipelining and improved the arithmetic CUDA kernels.\ For the sparse LU factorization, we implemented a supernodal sparse LU solver that can utilize multiple GPUs, and supports multithreading, pipelining, and batched factorization.
BibTeX:
@phdthesis{Tang2020a,
  author = {Meng Tang},
  title = {Performance Optimization for Sparse Matrix Factorization ALgorithms on Hybrid Multicore Architectures},
  school = {University of Florida},
  year = {2020},
  url = {https://search.proquest.com/openview/83c10f7445ac2e8d969bfec6fbe05a13/1?pq-origsite=gscholar&cbl=18750&diss=y}
}
Tao W, Pan Z, Wu G and Tao Q (2020), "The Strength of Nesterov's Extrapolation in the Individual Convergence of Nonsmooth Optimization", IEEE Transactions on Neural Networks and Learning Systems.
Abstract: The extrapolation strategy raised by Nesterov, which can accelerate the convergence rate of gradient descent methods by orders of magnitude when dealing with smooth convex objective, has led to tremendous success in training machine learning tasks. In this article, the convergence of individual iterates of projected subgradient (PSG) methods for nonsmooth convex optimization problems is theoretically studied based on Nesterov's extrapolation, which we name individual convergence. We prove that Nesterov's extrapolation has the strength to make the individual convergence of PSG optimal for nonsmooth problems. In light of this consideration, a direct modification of the subgradient evaluation suffices to achieve optimal individual convergence for strongly convex problems, which can be regarded as making an interesting step toward the open question about stochastic gradient descent (SGD) posed by Shamir. Furthermore, we give an extension of the derived algorithms to solve regularized learning tasks with nonsmooth losses in stochastic settings. Compared with other state-of-theart nonsmooth methods, the derived algorithms can serve as an alternative to the basic SGD especially in coping with machine learning problems, where an individual output is needed to guarantee the regularization structure while keeping an optimal rate of convergence. Typically, our method is applicable as an efficient tool for solving large-scale l1-regularized hinge-loss learning problems. Several comparison experiments demonstrate that our individual output not only achieves an optimal convergence rate but also guarantees better sparsity than the averaged solution.
BibTeX:
@article{Tao2020,
  author = {Wei Tao and Zhisong Pan and Gaowei Wu and Qing Tao},
  title = {The Strength of Nesterov's Extrapolation in the Individual Convergence of Nonsmooth Optimization},
  journal = {IEEE Transactions on Neural Networks and Learning Systems},
  year = {2020}
}
Thayer S, Gopalakrishnan GL, Briggs I, Bentley M, Ahn DH, Laguna I and Lee GL (2020), "ArcherGear", In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming., 2, 2020. ACM.
Abstract: There is growing uptake of shared memory parallelism in high performance computing, and this has increased the need for data race checking during the creation of new parallel codes or parallelizing existing sequential codes. While race checking concepts and implementations have been around for many concurrency models, including tasking models such as Cilk and PThreads (e.g., the Thread Sanitizer tool), practically usable race checkers for other APIs such as OpenMP have been lagging. For example, the OpenMP parallelization of an important library (namely Hypre) was initially unsuccessful due to inexplicable nondeterminism introduced when the code was optimized, and later root-caused to a race by the then recently developed OpenMP race checker Archer [2]. The open-source Archer now enjoys significant traction within several organizations.
BibTeX:
@inproceedings{Thayer2020,
  author = {Samuel Thayer and Ganesh L. Gopalakrishnan and Ian Briggs and Michael Bentley and Dong H. Ahn and Ignacio Laguna and Gregory L. Lee},
  title = {ArcherGear},
  booktitle = {Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
  publisher = {ACM},
  year = {2020},
  doi = {10.1145/3332466.3374504}
}
Thebelt A, Kronqvist J, Mistry M, Lee RM, Sudermann-Merx N and Misener R (2020), "ENTMOOT: A Framework for Optimization over Ensemble Tree Models", March, 2020.
Abstract: Gradient boosted trees and other regression tree models perform well in a wide range of real-world, industrial applications. These tree models (i) offer insight into important prediction features, (ii) effectively manage sparse data, and (iii) have excellent prediction capabilities. Despite their advantages, they are generally unpopular for decision-making tasks and black-box optimization, which is due to their difficult-to-optimize structure and the lack of a reliable uncertainty measure. ENTMOOT is our new framework for integrating (already trained) tree models into larger optimization problems. The contributions of ENTMOOT include: (i) explicitly introducing a reliable uncertainty measure that is compatible with tree models, (ii) solving the larger optimization problems that incorporate these uncertainty aware tree models, (iii) proving that the solutions are globally optimal, i.e. no better solution exists. In particular, we show how the ENTMOOT approach allows a simple integration of tree models into decision-making and black-box optimization, where it proves as a strong competitor to commonly-used frameworks.
BibTeX:
@article{Thebelt2020,
  author = {Alexander Thebelt and Jan Kronqvist and Miten Mistry and Robert M. Lee and Nathan Sudermann-Merx and Ruth Misener},
  title = {ENTMOOT: A Framework for Optimization over Ensemble Tree Models},
  year = {2020}
}
Tian Z, Zhang Y, Wang J and Gu C (2020), "Several relaxed iteration methods for computing PageRank", Journal of Computational and Applied Mathematics., November, 2020. Elsevier BV.
Abstract: In this paper, based on the iteration framework (Tian et al., 2019) and relaxed two-step splitting (RTSS) iteration method (Xie and Ma, 2018), we present two relaxed iteration methods for solving the PageRank problem, which are the relaxed generalized inner-outer (RGIO) and relaxed generalized two-step splitting (RGTSS) iteration methods, respectively. Next, their overall convergence properties are analyzed in detail, and choices of the parameters in these algorithms are also discussed. Finally, several numerical examples are given to illustrate the effectiveness of the proposed algorithms.
BibTeX:
@article{Tian2020,
  author = {Zhaolu Tian and Yan Zhang and Junxin Wang and Chuanqing Gu},
  title = {Several relaxed iteration methods for computing PageRank},
  journal = {Journal of Computational and Applied Mathematics},
  publisher = {Elsevier BV},
  year = {2020},
  doi = {10.1016/j.cam.2020.113295}
}
Titolo L, Moscato M and Muñoz CA (2020), "Automatic generation and verification of test-stable floating-point code", January, 2020.
Abstract: Test instability in a floating-point program occurs when the control flow of the program diverges from its ideal execution assuming real arithmetic. This phenomenon is caused by the presence of round-off errors that affect the evaluation of arithmetic expressions occurring in conditional statements. Unstable tests may lead to significant errors in safety-critical applications that depend on numerical computations. Writing programs that take into consideration test instability is a difficult task that requires expertise on finite precision computations and rounding errors. This paper presents a toolchain to automatically generate and verify a provably correct test-stable floating-point program from a functional specification in real arithmetic. The input is a real-valued program written in the Prototype Verification System (PVS) specification language and the output is a transformed floating-point C program annotated with ANSI/ISO C Specification Language (ACSL) contracts. These contracts relate the floating-point program to its functional specification in real arithmetic. The transformed program detects if unstable tests may occur and, in these cases, issues a warning and terminate. An approach that combines the Frama-C analyzer, the PRECiSA round-off error estimator, and PVS is proposed to automatically verify that the generated program code is correct in the sense that, if the program terminates without a warning, it follows the same computational path as its real-valued functional specification.
BibTeX:
@article{Titolo2020,
  author = {Laura Titolo and Mariano Moscato and Cesar A. Muñoz},
  title = {Automatic generation and verification of test-stable floating-point code},
  year = {2020}
}
Titolo L, Moscato M, Feliu MA and Muñoz CA (2020), "Automatic Generation of Guard-Stable Floating-Point Code", In Lecture Notes in Computer Science. , pp. 141-159. Springer International Publishing.
Abstract: In floating-point programs, guard instability occurs when the control flow of a conditional statement diverges from its ideal execution under real arithmetic. This phenomenon is caused by the presence of round-off errors in floating-point computations. Writing programs that correctly handle guard instability often requires expertise on finite precision arithmetic. This paper presents a fully automatic toolchain that generates and formally verifies a guard-stable floating-point C program from its functional specification in real arithmetic. The generated program is instrumented to soundly detect when unstable guards may occur and, in these cases, to issue a warning. The proposed approach combines the PRECiSA floating-point static analyzer, the Frama-C software verification suite, and the PVS theorem prover.
BibTeX:
@incollection{Titolo2020a,
  author = {Laura Titolo and Mariano Moscato and Marco A. Feliu and César A. Muñoz},
  title = {Automatic Generation of Guard-Stable Floating-Point Code},
  booktitle = {Lecture Notes in Computer Science},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {141--159},
  doi = {10.1007/978-3-030-63461-2_8}
}
Tripathy A, Yelick K and Buluc A (2020), "Reducing Communication in Graph Neural Network Training", May, 2020.
Abstract: Graph Neural Networks (GNNs) are powerful and flexible neural networks that use the naturally sparse connectivity information of the data. GNNs represent this connectivity as sparse matrices, which have lower arithmetic intensity and thus higher communication costs compared to dense matrices, making GNNs harder to scale to high concurrencies than convolutional or fully-connected neural networks. We present a family of parallel algorithms for training GNNs. These algorithms are based on their counterparts in dense and sparse linear algebra, but they had not been previously applied to GNN training. We show that they can asymptotically reduce communication compared to existing parallel GNN training methods. We implement a promising and practical version that is based on 2D sparse-dense matrix multiplication using torch.distributed. Our implementation parallelizes over GPU-equipped clusters. We train GNNs on up to a hundred GPUs on datasets that include a protein network with over a billion edges.
BibTeX:
@article{Tripathy2020,
  author = {Alok Tripathy and Katherine Yelick and Aydin Buluc},
  title = {Reducing Communication in Graph Neural Network Training},
  year = {2020}
}
Trotter JD, Langguth J and Cai X (2020), "Cache simulation for irregular memory traffic on multi-core CPUs: Case study on performance models for sparse matrix–vector multiplication", Journal of Parallel and Distributed Computing., 10, 2020. Vol. 144, pp. 189-205. Elsevier BV.
Abstract: Parallel computations with irregular memory access patterns are often limited by the memory subsystems of multi-core CPUs, though it can be difficult to pinpoint and quantify performance bottlenecks precisely. We present a method for estimating volumes of data traffic caused by irregular, parallel computations on multi-core CPUs with memory hierarchies containing both private and shared caches. Further, we describe a performance model based on these estimates that applies to bandwidth-limited computations. As a case study, we consider two standard algorithms for sparse matrix–vector multiplication, a widely used, irregular kernel. Using three different multi-core CPU systems and a set of matrices that induce a range of irregular memory access patterns, we demonstrate that our cache simulation combined with the proposed performance model accurately quantifies performance bottlenecks that would not be detected using standard best- or worst-case estimates of the data traffic volume.
BibTeX:
@article{Trotter2020,
  author = {James D. Trotter and Johannes Langguth and Xing Cai},
  title = {Cache simulation for irregular memory traffic on multi-core CPUs: Case study on performance models for sparse matrix–vector multiplication},
  journal = {Journal of Parallel and Distributed Computing},
  publisher = {Elsevier BV},
  year = {2020},
  volume = {144},
  pages = {189--205},
  doi = {10.1016/j.jpdc.2020.05.020}
}
Tsai YM, Cojean T and Anzt H (2020), "Sparse Linear Algebra on AMD and NVIDIA GPUs – The Race Is On", In Lecture Notes in Computer Science. , pp. 309-327. Springer International Publishing.
Abstract: Efficiently processing sparse matrices is a central and performance-critical part of many scientific simulation codes. Recognizing the adoption of manycore accelerators in HPC, we evaluate in this paper the performance of the currently best sparse matrix-vector product (SpMV) implementations on high-end GPUs from AMD and NVIDIA. Specifically, we optimize SpMV kernels for the CSR, COO, ELL, and HYB format taking the hardware characteristics of the latest GPU technologies into account. We compare for 2,800 test matrices the performance of our kernels against AMD's hipSPARSE library and NVIDIA's cuSPARSE library, and ultimately assess how the GPU technologies from AMD and NVIDIA compare in terms of SpMV performance.
BibTeX:
@incollection{Tsai2020,
  author = {Yuhsiang M. Tsai and Terry Cojean and Hartwig Anzt},
  title = {Sparse Linear Algebra on AMD and NVIDIA GPUs – The Race Is On},
  booktitle = {Lecture Notes in Computer Science},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {309--327},
  doi = {10.1007/978-3-030-50743-5_16}
}
Tsai YM, Cojean T, Ribizel T and Anzt H (2020), "Preparing Ginkgo for AMD GPUs -- A Testimonial on Porting CUDA Code to HIP", June, 2020.
Abstract: With AMD reinforcing their ambition in the scientific high performance computing ecosystem, we extend the hardware scope of the Ginkgo linear algebra package to feature a HIP backend for AMD GPUs. In this paper, we report and discuss the porting effort from CUDA, the extension of the HIP framework to add missing features such as cooperative groups, the performance price of compiling HIP code for AMD architectures, and the design of a library providing native backends for NVIDIA and AMD GPUs while minimizing code duplication by using a shared code base.
BibTeX:
@article{Tsai2020a,
  author = {Yuhsiang M. Tsai and Terry Cojean and Tobias Ribizel and Hartwig Anzt},
  title = {Preparing Ginkgo for AMD GPUs -- A Testimonial on Porting CUDA Code to HIP},
  year = {2020}
}
Tsai YM, Cojean T and Anzt H (2020), "Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse Linear Algebra Computations", August, 2020.
Abstract: GPU accelerators have become an important backbone for scientific high performance computing, and the performance advances obtained from adopting new GPU hardware are significant. In this paper we take a first look at NVIDIA's newest server line GPU, the A100 architecture part of the Ampere generation. Specifically, we assess its performance for sparse linear algebra operations that form the backbone of many scientific applications and assess the performance improvements over its predecessor.
BibTeX:
@article{Tsai2020b,
  author = {Yuhsiang Mike Tsai and Terry Cojean and Hartwig Anzt},
  title = {Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse Linear Algebra Computations},
  year = {2020}
}
Tupitsa N, Gasnikov A, Dvurechensky P and Guminov S (2020), "Strongly Convex Optimization for the Dual Formulation of Optimal Transport", In Mathematical Optimization Theory and Operations Research. , pp. 192-204. Springer International Publishing.
Abstract: In this paper we experimentally check a hypothesis, that dual problem to discrete entropy regularized optimal transport problem possesses strong convexity on a certain compact set. We present a numerical estimation technique of parameter of strong convexity and show that such an estimate increases the performance of an accelerated alternating minimization algorithm for strongly convex functions applied to the considered problem.
BibTeX:
@incollection{Tupitsa2020,
  author = {Nazarii Tupitsa and Alexander Gasnikov and Pavel Dvurechensky and Sergey Guminov},
  title = {Strongly Convex Optimization for the Dual Formulation of Optimal Transport},
  booktitle = {Mathematical Optimization Theory and Operations Research},
  publisher = {Springer International Publishing},
  year = {2020},
  pages = {192--204},
  doi = {10.1007/978-3-030-58657-7_17}
}
Tutunov R, Li M, Wang J and Bou-Ammar H (2020), "Compositional ADAM: An Adaptive Compositional Solver", February, 2020.
Abstract: In this paper, we present C-ADAM, the first adaptive solver for compositional problems involving a non-linear functional nesting of expected values. We proof that C-ADAM converges to a stationary point in &Oscr;(-2.25) with δ being a precision parameter. Moreover, we demonstrate the importance of our results by bridging, for the first time, model-agnostic meta-learning (MAML) and compositional optimisation showing fastest known rates for deep network adaptation to-date. Finally, we validate our findings in a set of experiments from portfolio optimisation and meta-learning. Our results manifest significant sample complexity reductions compared to both standard and compositional solvers.
BibTeX:
@article{Tutunov2020,
  author = {Rasul Tutunov and Minne Li and Jun Wang and Haitham Bou-Ammar},
  title = {Compositional ADAM: An Adaptive Compositional Solver},
  year = {2020}
}
Tzovas C, Predari M and Meyerhenke H (2020), "Distributing Sparse Matrix/Graph Applications in Heterogeneous Clusters -- an Experimental Study", November, 2020.
Abstract: Many problems in scientific and engineering applications contain sparse matrices or graphs as main input objects, e.g. numerical simulations on meshes. Large inputs are abundant these days and require parallel processing for memory size and speed. To optimize the execution of such simulations on cluster systems, the input problem needs to be distributed suitably onto the processing units (PUs). More and more frequently, such clusters contain different CPUs or a combination of CPUs and GPUs. This heterogeneity makes the load distribution problem quite challenging. Our study is motivated by the observation that established partitioning tools do not handle such heterogeneous distribution problems as well as homogeneous ones. In this paper, we first formulate the problem of balanced load distribution for heterogeneous architectures as a multi-objective, single-constraint optimization problem. We then split the problem into two phases and propose a greedy approach to determine optimal block sizes for each PU. These block sizes are then fed into numerous existing graph partitioners, for us to examine how well they handle the above problem. One of the tools we consider is an extension of our own previous work (von Looz et al, ICPP'18) called Geographer. Our experiments on well-known benchmark meshes indicate that only two tools under consideration are able to yield good quality. These two are Parmetis (both the geometric and the combinatorial variant) and Geographer. While Parmetis is faster, Geographer yields better quality on average.
BibTeX:
@article{Tzovas2020,
  author = {Charilaos Tzovas and Maria Predari and Henning Meyerhenke},
  title = {Distributing Sparse Matrix/Graph Applications in Heterogeneous Clusters -- an Experimental Study},
  year = {2020}
}
Uribe CA and Jadbabaie A (2020), "A Distributed Cubic-Regularized Newton Method for Smooth Convex Optimization over Networks", July, 2020.
Abstract: We propose a distributed, cubic-regularized Newton method for large-scale convex optimization over networks. The proposed method requires only local computations and communications and is suitable for federated learning applications over arbitrary network topologies. We show a O(k^-3) convergence rate when the cost function is convex with Lipschitz gradient and Hessian, with k being the number of iterations. We further provide network-dependent bounds for the communication required in each step of the algorithm. We provide numerical experiments that validate our theoretical results.
BibTeX:
@article{Uribe2020,
  author = {César A. Uribe and Ali Jadbabaie},
  title = {A Distributed Cubic-Regularized Newton Method for Smooth Convex Optimization over Networks},
  year = {2020}
}
Uroić T and Jasak H (2020), "Parallelisation of selective algebraic multigrid for block--pressure--velocity system in OpenFOAM", Computer Physics Communications., 8, 2020. , pp. 107529. Elsevier BV.
Abstract: In the world of computational fluid dynamics (CFD), solving the governing equations of incompressible, turbulent, single--phase fluid flow still represents the basis of many industrial and academic applications. The implicitly coupled (monolithic) solution approach is still being developed and investigated for industrial--size applications. A parallel selection algebraic multigrid algorithm (AMG) based on the domain decomposition method is presented, applied for the solution of the linearised implicitly coupled pressure--velocity system discretised by the finite volume method, implemented in OpenFOAM. Since the setup phase of the selection AMG, i.e. sorting the equations into coarse and fine subsets is inherently sequential, it was decided to perform the setup phase locally on each processing unit. The prolongation matrix for transferring the correction from coarse to fine level and restriction matrix for transferring the residual from fine to coarse level are assembled locally as well. Parallel communication is necessary only for the calculation of the coarse level matrix, i.e. the matrix elements which describe the cross--coupling of equations located on different processing units. A localised version of the ILU factorisation based on Crout's algorithm is used as a smoother in the multigrid cycle. A detailed analysis of the coarse level matrix complexity is conducted in the context of the finite volume method in domain decomposition mode. The performance and scaling of our parallel implementation is investigated for two test cases and the possible drawbacks of the method are given.
BibTeX:
@article{Uroic2020,
  author = {Tessa Uroić and Hrvoje Jasak},
  title = {Parallelisation of selective algebraic multigrid for block--pressure--velocity system in OpenFOAM},
  journal = {Computer Physics Communications},
  publisher = {Elsevier BV},
  year = {2020},
  pages = {107529},
  doi = {10.1016/j.cpc.2020.107529}
}
Valero-Lara P, Catalán S, Martorell X, Usui T and Labarta J (2020), "sLASs: A fully automatic auto-tuned linear algebra library based on OpenMP extensions implemented in OmpSs (LASs Library)", Journal of Parallel and Distributed Computing.
Abstract: In this work we have implemented a novel Linear Algebra Library on top of the task-based runtime OmpSs-2. We have used some of the most advanced OmpSs-2 features; weak dependencies and regions, together with the final clause for the implementation of auto-tunable code for the BLAS-3 trsm routine and the LAPACK routines npgetrf and npgesv. All these implementations are part of the first prototype of sLASs library, a novel library for auto-tunable codes for linear algebra operations based on LASs library. In all these cases, the use of the OmpSs-2 features presents an improvement in terms of execution time against other reference libraries such as, the original LASs library, PLASMA, ATLAS and Intel MKL. These codes are able to reduce the execution time in about 18% on big matrices, by increasing the IPC on gemm and reducing the time of task instantiation. For a few medium matrices, benefits are also seen. For small matrices and a subset of medium matrices, specific optimizations that allow to increase the degree of parallelism in both, gemm and trsm tasks, are applied. This strategy achieves an increment in performance of up to 40%.
BibTeX:
@article{ValeroLara2020,
  author = {Pedro Valero-Lara and Sandra Catalán and Xavier Martorell and Tetsuzo Usui and Jesús Labarta},
  title = {sLASs: A fully automatic auto-tuned linear algebra library based on OpenMP extensions implemented in OmpSs (LASs Library)},
  journal = {Journal of Parallel and Distributed Computing},
  year = {2020},
  url = {http://www.sciencedirect.com/science/article/pii/S0743731519303417},
  doi = {10.1016/j.jpdc.2019.12.002}
}
Vanover J, Deng X and Rubio-González C (2020), "Discovering Discrepancies in Numerical Libraries", Proceedings of the International Symposium on Software Testing and Analysis. ACM.
Abstract: Numerical libraries constitute the building blocks for software applications that perform numerical calculations. Thus, it is paramount that such libraries provide accurate and consistent results. To that end, this paper addresses the problem of finding discrepancies between synonymous functions in different numerical libraries as a means of identifying incorrect behavior. Our approach automatically finds such synonymous functions, synthesizes testing drivers, and executes differential tests to discover meaningful discrepancies across numerical libraries. We implement our approach in a tool named FPDiff, and provide an evaluation on four popular numerical libraries: GNU Scientific Library (GSL), SciPy, mpmath, and jmat. FPDiff finds a total of 126 equivalence classes with a 95.8% precision and 79.0% recall, and discovers 655 instances in which an input produces a set of disagreeing outputs between function synonyms, 150 of which we found to represent 125 unique bugs. We have reported all bugs to library maintainers; so far, 30 bugs have been fixed, 9 have been found to be previously known, and 25 more have been acknowledged by developers.
BibTeX:
@article{Vanover2020,
  author = {Jackson Vanover and Xuan Deng and Cindy Rubio-González},
  title = {Discovering Discrepancies in Numerical Libraries},
  journal = {Proceedings of the International Symposium on Software Testing and Analysis},
  publisher = {ACM},
  year = {2020},
  url = {https://web.cs.ucdavis.edu/ rubio/includes/issta20.pdf}
}
Vatai E, Singhal U and Suda R (2020), "Diamond Matrix Powers Kernels", In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region. New York, NY, USA , pp. 102-113. ACM.
Abstract: Matrix powers kernel calculates the vectors Akv, for k = 1, 2,..., m and they are the heart of various scientific computations, including communication avoiding iterative solvers. In this paper we propose diamond matrix powers kernel - DMPK, which has the purpose to apply the "diamond tiling" stencil algorithm to general matrices. It can also be considered as an extension of the PA1 and PA2 algorithms, introduced by Demmel et al. Our approach enables us to control the balance between the amount of communication avoidance and redundant computation inherently present in communication avoiding algorithms. We present a proof of concept implementation of the algorithm using MPI routines. The experiments we performed show that the control of the amount of computation and communication is achievable, and with more thorough optimisations, DMPK is a promising alternative to existing MPK approaches.
BibTeX:
@inproceedings{Vatai2020,
  author = {Vatai, Emil and Singhal, Utsav and Suda, Reiji},
  title = {Diamond Matrix Powers Kernels},
  booktitle = {Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region},
  publisher = {ACM},
  year = {2020},
  pages = {102--113},
  url = {http://doi.acm.org/10.1145/3368474.3368494},
  doi = {10.1145/3368474.3368494}
}
Vlaski S and Sayed AH (2020), "Second-Order Guarantees in Centralized, Federated and Decentralized Nonconvex Optimization", March, 2020.
Abstract: Rapid advances in data collection and processing capabilities have allowed for the use of increasingly complex models that give rise to nonconvex optimization problems. These formulations, however, can be arbitrarily difficult to solve in general, in the sense that even simply verifying that a given point is a local minimum can be NP-hard [1]. Still, some relatively simple algorithms have been shown to lead to surprisingly good empirical results in many contexts of interest. Perhaps the most prominent example is the success of the backpropagation algorithm for training neural networks. Several recent works have pursued rigorous analytical justification for this phenomenon by studying the structure of the nonconvex optimization problems and establishing that simple algorithms, such as gradient descent and its variations, perform well in converging towards local minima and avoiding saddle-points. A key insight in these analyses is that gradient perturbations play a critical role in allowing local descent algorithms to efficiently distinguish desirable from undesirable stationary points and escape from the latter. In this article, we cover recent results on second-order guarantees for stochastic first-order optimization algorithms in centralized, federated, and decentralized architectures.
BibTeX:
@article{Vlaski2020,
  author = {Stefan Vlaski and Ali H. Sayed},
  title = {Second-Order Guarantees in Centralized, Federated and Decentralized Nonconvex Optimization},
  year = {2020}
}
Wang H, Wei Z, Yuan Y, Du X and Wen J-R (2020), "Exact Single-Source SimRank Computation on Large Graphs", April, 2020.
Abstract: SimRank is a popular measurement for evaluating the node-to-node similarities based on the graph topology. In recent years, single-source and top-k SimRank queries have received increasing attention due to their applications in web mining, social network analysis, and spam detection. However, a fundamental obstacle in studying SimRank has been the lack of ground truths. The only exact algorithm, Power Method, is computationally infeasible on graphs with more than 10^6 nodes. Consequently, no existing work has evaluated the actual trade-offs between query time and accuracy on large real-world graphs. In this paper, we present ExactSim, the first algorithm that computes the exact single-source and top-k SimRank results on large graphs. With high probability, this algorithm produces ground truths with a rigorous theoretical guarantee. We conduct extensive experiments on real-world datasets to demonstrate the efficiency of ExactSim. The results show that ExactSim provides the ground truth for any single-source SimRank query with a precision up to 7 decimal places within a reasonable query time.
BibTeX:
@article{Wang2020,
  author = {Hanzhi Wang and Zhewei Wei and Ye Yuan and Xiaoyong Du and Ji-Rong Wen},
  title = {Exact Single-Source SimRank Computation on Large Graphs},
  year = {2020},
  doi = {10.1145/3318464.3389781}
}
Wang C-L, Nie F, Wang R and Li X (2020), "Revisiting Fast Spectral Clustering with Anchor Graph", In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., 5, 2020. IEEE.
Abstract: Many anchor-graph-based spectral clustering methods have been proposed to accelerate spectral clustering for large scale problems. In this paper, we revisit the popular large-scale spectral clustering method based on the anchor graph which is equivalent to the spectral decomposition on a similar matrix obtained using a second-order transition probability. However, due to the special structure of the bipartite graph, there is no stable distribution of the random walk process. The even-order transition probabilities may only a side view of the bipartite structure, resulting in breaking the independence of data points and leading to undesired artifacts for boundary samples. Therefore, we propose a Fast Spectral Clustering based on the Random Walk Laplacian (FRWL) method. The random walk Laplacian balances explicitly the popularity of anchors and the independence of data points, which keeps the structure of boundary samples. The experimental results demonstrate the efficiency and effectiveness of our method.
BibTeX:
@inproceedings{Wang2020a,
  author = {Cheng-Long Wang and Feiping Nie and Rong Wang and Xuelong Li},
  title = {Revisiting Fast Spectral Clustering with Anchor Graph},
  booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  publisher = {IEEE},
  year = {2020},
  doi = {10.1109/icassp40776.2020.9053271}
}
Wang L (2020), "Parallel Algorithms on Graph Matching". Thesis at: University of California Davis.
Abstract: Subgraph matching is a basic task in querying graph dataset. It can also be called subgraph isomorphism search which consists to find all embeddings of a small query graph in a large data graph. It is one of the key techniques for understanding the underlying structure of graph datasets.\ Graphs have been used to provide meaningful representations of objects and patterns, as well as more abstract descriptions. The representative power of graphs lies in their ability to characterize multiple pieces of information, as well as the relationships between them. Because of those properties, graph data structures have been leveraged in a wide spectrum of applications including social media, the World Wide Web, biological and genetic interactions, cyber network, co-author networks, citations, etc.. And at the heart of graph theory is the problem of graph matching, which attempts to find a way to map one graph onto another in such a way that both the topological structure and the node and edge labels are matched. For domains where data is noisy, an identical match may not be possible, so an inexact graph matching algorithm is used to search for the closest match, minimizing some similarity function.\ There have been two completely different directions for supporting subgraph pattern matching. One direction is to develop specialized query processing engines, while the other direction is to develop efficient subgraph isomorphism algorithms for general, labeled graphs. Previously, both directions target distributed CPU systems. But the expensive network transfer overhead becomes a bottleneck. In order to explore the efficiency and parallel abilities of a single computer, we address latter direction of Subgraph Matching.\ Most of previous works of subgraph matching fall into three classes of approaches: depth-first tree search, constraint propagation and graph indexing, all of which are not efficient on GPUs. Former intention to run subgraph matching on GPUs only targets a specific application and turns out to be memory-bounded. My research intends to tackle the bottleneck and further make subgraph matching meet the needs of a great spectrum of real-world applications.
BibTeX:
@phdthesis{Wang2020b,
  author = {Leyuan Wang},
  title = {Parallel Algorithms on Graph Matching},
  school = {University of California Davis},
  year = {2020}
}
Wang H, Keskar NS, Xiong C and Socher R (2020), "Assessing Local Generalization Capability in Deep Models", In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics. Vol. 108
Abstract: While it has not yet been proven, empirical evidence suggests that model generalization is related to local properties of the optima, which can be described via the Hessian. We connect model generalization with the local property of a solution under the PAC-Bayes paradigm. In particular, we prove that model generalization ability is related to the Hessian, the higher-order “smoothness” terms characterized by the Lipschitz constant of the Hessian, and the scales of the parameters. Guided by the proof, we propose a metric to score the generalization capability of a model, as well as an algorithm that optimizes the perturbed model accordingly.
BibTeX:
@inproceedings{Wang2020c,
  author = {Huan Wang and Nitish Shirish Keskar and Caiming Xiong and Richard Socher},
  title = {Assessing Local Generalization Capability in Deep Models},
  booktitle = {Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics},
  year = {2020},
  volume = {108}
}
Wang M, Jia S, Chen E, Yang S, Liu P and Qi Z (2020), "A derived least square fast learning network model", Applied Intelligence., 7, 2020. Springer Science and Business Media LLC.
Abstract: The extreme learning machine (ELM) requires a large number of hidden layer nodes in the training process. Thus, random parameters will exponentially increase and affect network stability. Moreover, the single activation function affects the generalization capability of the network. This paper proposes a derived least square fast learning network (DLSFLN) to solve the aforementioned problems. DLSFLN uses the inheritance of some functions to obtain various activation functions through continuous differentiation of functions. The types of activation functions were increased and the mapping capability of hidden layer neurons was enhanced when the random parameter dimension was maintained. DLSFLN randomly generates the input weights and hidden layer thresholds and uses the least square method to determine the connection weights between the output and the input layers and that between the output and the input nodes. The regression and classification experiments show that DLSFLN has a faster training speed and better training accuracy, generalization capability, and stability compared with other neural network algorithms, such as fast learning network(FLN).
BibTeX:
@article{Wang2020d,
  author = {Meiqi Wang and Sixian Jia and Enli Chen and Shaopu Yang and Pengfei Liu and Zhuang Qi},
  title = {A derived least square fast learning network model},
  journal = {Applied Intelligence},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  doi = {10.1007/s10489-020-01773-6}
}
Wang J and Xia Y (2020), "Closing the Gap between Necessary and Sufficient Conditions for Local Nonglobal Minimizer of Trust Region Subproblem", SIAM Journal on Optimization., 1, 2020. Vol. 30(3), pp. 1980-1995. Society for Industrial & Applied Mathematics (SIAM).
Abstract: The trust region subproblem has at most one local nonglobal minimizer. In characterizing this local solution, there is a clear gap between necessary and sufficient conditions. In this paper, we surprisingly show that the sufficient second-order optimality condition remains necessary. As an application, we improve the state-of-the-art algorithm for computing a candidate of the local nonglobal minimizer and then show that finding the local nonglobal minimizer or proving the nonexistence can be done in polynomial time.
BibTeX:
@article{Wang2020e,
  author = {Jiulin Wang and Yong Xia},
  title = {Closing the Gap between Necessary and Sufficient Conditions for Local Nonglobal Minimizer of Trust Region Subproblem},
  journal = {SIAM Journal on Optimization},
  publisher = {Society for Industrial & Applied Mathematics (SIAM)},
  year = {2020},
  volume = {30},
  number = {3},
  pages = {1980--1995},
  doi = {10.1137/19m1294459}
}
Wei X, Yue H, Gao S, Li L, Zhang R and Tan J (2020), "G-SEAP: Analyzing and characterizing soft-error aware approximation in GPGPUs", Future Generation Computer Systems., 3, 2020. Elsevier BV.
Abstract: As General-Purpose Graphics Processing Units (GPGPUs) become pervasive for the High-Performance Computing (HPC), ensuring that programs can be protected from soft errors has become increasingly important. Soft errors may cause Silent Data Corruptions (SDCs), which produces erroneous execution results silently. Due to the massive parallelism of GPGPUs, fully protecting them against soft errors introduces nontrivial overhead. Fortunately, imprecise execution outcomes are inherently tolerable for some HPC programs due to the nature of these applications. Leveraging the feature, selective soft error protection can be applied to reduce energy consumptions.\ In this work, we first propose a GPGPU-based Soft-Error aware APproximation analysis framework (G-SEAP) to characterize the approximation characteristics of soft errors. Based on G-SEAP, we perform an exhaustive analysis for 17 representative HPC benchmarks and observe 72.7% of SDCs on average are approximable. We also observe that the dataflow of application, kernel function reliability requirement, instruction-type, and data bit-location are all important factors for program's correctness. Lastly, according to observations, we further design an approximate Error Correction Codes (ECCs) mechanism and an approximate instruction duplication technique to illustrate how G-SEAP provides useful guidance for energy-efficient soft-error elimination in GPGPUs.
BibTeX:
@article{Wei2020,
  author = {Xiaohui Wei and Hengshan Yue and Shang Gao and Lina Li and Ruyu Zhang and Jingweijia Tan},
  title = {G-SEAP: Analyzing and characterizing soft-error aware approximation in GPGPUs},
  journal = {Future Generation Computer Systems},
  publisher = {Elsevier BV},
  year = {2020},
  doi = {10.1016/j.future.2020.03.040}
}
Wen J, Zhang X, Gao H, Yuan J and Fang Y (2020), "EffMoP: Efficient Motion Planning Based on Heuristic-Guided Motion Primitives Pruning and Path Optimization With Sparse-Banded Structure", December, 2020.
Abstract: To solve the autonomous navigation problem in complex environments, an efficient motion planning approach called EffMoP is presented in this paper. Considering the challenges from large-scale, partially unknown complex environments, a three-layer motion planning framework is elaborately designed, including global path planning, local path optimization, and time-optimal velocity planning. Compared with existing approaches, the novelty of this work is twofold: 1) a heuristic-guided pruning strategy of motion primitives is newly designed and fully integrated into the search-based global path planner to improve the computational efficiency of graph search, and 2) a novel soft-constrained local path optimization approach is proposed, wherein the sparse-banded system structure of the underlying optimization problem is fully exploited to efficiently solve the problem. We validate the safety, smoothness, flexibility, and efficiency of EffMoP in various complex simulation scenarios and challenging real-world tasks. It is shown that the computational efficiency is improved by 66.21% in the global planning stage and the motion efficiency of the robot is improved by 22.87% compared with the recent quintic Bézier curve-based state space sampling approach.
BibTeX:
@article{Wen2020,
  author = {Jian Wen and Xuebo Zhang and Haiming Gao and Jing Yuan and Yongchun Fang},
  title = {EffMoP: Efficient Motion Planning Based on Heuristic-Guided Motion Primitives Pruning and Path Optimization With Sparse-Banded Structure},
  year = {2020}
}
Wiebe J, Cecílio I, Dunlop J and Misener R (2020), "A robust approach to warped Gaussian process-constrained optimization", June, 2020.
Abstract: Optimization problems with uncertain black-box constraints, modeled by warped Gaussian processes, have recently been considered in the Bayesian optimization setting. This work introduces a new class of constraints in which the same black-box function occurs multiple times evaluated at different domain points. Such constraints are important in applications where, e.g., safety-critical measures are aggregated over multiple time periods. Our approach, which uses robust optimization, reformulates these uncertain constraints into deterministic constraints guaranteed to be satisfied with a specified probability, i.e., deterministic approximations to a chance constraint. This approach extends robust optimization methods from parametric uncertainty to uncertain functions modeled by warped Gaussian processes. We analyze convexity conditions and propose a custom global optimization strategy for non-convex cases. A case study derived from production planning and an industrially relevant example from oil well drilling show that the approach effectively mitigates uncertainty in the learned curves. For the drill scheduling example, we develop a custom strategy for globally optimizing integer decisions.
BibTeX:
@article{Wiebe2020,
  author = {Johannes Wiebe and Inês Cecílio and Jonathan Dunlop and Ruth Misener},
  title = {A robust approach to warped Gaussian process-constrained optimization},
  year = {2020}
}
Wilkinson L and Luo H (2020), "A Distance-preserving Matrix Sketch", September, 2020.
Abstract: Visualizing very large matrices involves many formidable problems. Various popular solutions to these problems involve sampling, clustering, projection, or feature selection to reduce the size and complexity of the original task. An important aspect of these methods is how to preserve relative distances between points in the higher-dimensional space after reducing rows and columns to fit in a lower dimensional space. This aspect is important because conclusions based on faulty visual reasoning can be harmful. Judging dissimilar points as similar or similar points as dissimilar on the basis of a visualization can lead to false conclusions. To ameliorate this bias and to make visualizations of very large datasets feasible, we introduce a new algorithm that selects a subset of rows and columns of a rectangular matrix. This selection is designed to preserve relative distances as closely as possible. We compare our matrix sketch to more traditional alternatives on a variety of artificial and real datasets.
BibTeX:
@article{Wilkinson2020,
  author = {Leland Wilkinson and Hengrui Luo},
  title = {A Distance-preserving Matrix Sketch},
  year = {2020}
}
Wongpanich A, You Y and Demmel J (2020), "Rethinking the Value of Asynchronous Solvers for Distributed Deep Learning", In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region. New York, NY, USA , pp. 52-60. ACM.
Abstract: In recent years, the field of machine learning has seen significant advances as data becomes more abundant and deep learning models become larger and more complex. However, these improvements in accuracy [2] have come at the cost of longer training time. As a result, state-of-the-art models like OpenAI's GPT-2 [18] or AlphaZero [20] require the use of distributed systems or clusters in order to speed up training. Currently, there exist both asynchronous and synchronous solvers for distributed training. In this paper, we implement state-of-the-art asynchronous and synchronous solvers, then conduct a comparison between them to help readers pick the most appropriate solver for their own applications. We address three main challenges: (1) implementing asynchronous solvers that can outperform six common algorithm variants, (2) achieving state-of-the-art distributed performance for various applications with different computational patterns, and (3) maintaining accuracy for large-batch asynchronous training. For asynchronous algorithms, we implement an algorithm called EA-wild, which combines the idea of non-locking wild updates from Hogwild! [19] with EASGD. Our implementation is able to scale to 217,600 cores and finish 90 epochs of training the ResNet-50 model on ImageNet in 15 minutes (the baseline takes 29 hours on eight NVIDIA P100 GPUs). We conclude that more complex models (e.g., ResNet-50) favor synchronous methods, while our asynchronous solver outperforms the synchronous solver for models with a low computation-communication ratio. The results are documented in this paper; for more results, readers can refer to our supplemental website 1.
BibTeX:
@inproceedings{Wongpanich2020,
  author = {Wongpanich, Arissa and You, Yang and Demmel, James},
  title = {Rethinking the Value of Asynchronous Solvers for Distributed Deep Learning},
  booktitle = {Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region},
  publisher = {ACM},
  year = {2020},
  pages = {52--60},
  url = {http://doi.acm.org/10.1145/3368474.3368498},
  doi = {10.1145/3368474.3368498}
}
Wongpanich A, Pham H, Demmel J, Tan M, Le Q, You Y and Kumar S (2020), "Training EfficientNets at Supercomputer Scale: 8% ImageNet Top-1 Accuracy in One Hour", October, 2020.
Abstract: EfficientNets are a family of state-of-the-art image classification models based on efficiently scaled convolutional neural networks. Currently, EfficientNets can take on the order of days to train; for example, training an EfficientNet-B0 model takes 23 hours on a Cloud TPU v2-8 node. In this paper, we explore techniques to scale up the training of EfficientNets on TPU-v3 Pods with 2048 cores, motivated by speedups that can be achieved when training at such scales. We discuss optimizations required to scale training to a batch size of 65536 on 1024 TPU-v3 cores, such as selecting large batch optimizers and learning rate schedules as well as utilizing distributed evaluation and batch normalization techniques. Additionally, we present timing and performance benchmarks for EfficientNet models trained on the ImageNet dataset in order to analyze the behavior of EfficientNets at scale. With our optimizations, we are able to train EfficientNet on ImageNet to an accuracy of 83% in 1 hour and 4 minutes.
BibTeX:
@article{Wongpanich2020a,
  author = {Arissa Wongpanich and Hieu Pham and James Demmel and Mingxing Tan and Quoc Le and Yang You and Sameer Kumar},
  title = {Training EfficientNets at Supercomputer Scale: 8% ImageNet Top-1 Accuracy in One Hour},
  year = {2020}
}
Xia Y, Guo S, Hao J, Liu D and Xu J (2020), "Error detection of arithmetic expressions", The Journal of Supercomputing., 11, 2020. Springer Science and Business Media LLC.
Abstract: Inspecting floating-point errors is essential to floating-point operations. In this paper, we present floating-point error detector (FPED), an inspector of floating-point errors for arithmetic expressions. FPED can pick a suitable benchmark generation approach by analyzing the distribution of the expression of a floating-point operation, thereby minimizing the possibilities of underreporting floating-point errors. FPED is also able to determine the significant sources of errors in a floating-point operation according to the frequencies of computation building blocks that contribute most to the floating-point errors, benefiting the follow-up optimizations of computation accuracies. We validate the correctness and functionalities of FPED by conducting experiments on the FPBench benchmark suite. The experimental results demonstrate that FPED can obtain more accurate detection results than the random detecting approach with respect to floating-point error detection. We also compare FPED with the existing dynamic error detection tools. The experimental results show that in most of the 33 test benchmarks, the maximum error results of FPED are greater than Herbgrind and the detection performance is higher than Herbgrind.
BibTeX:
@article{Xia2020,
  author = {Yuanyuan Xia and Shaozhong Guo and Jiangwei Hao and Dan Liu and Jinchen Xu},
  title = {Error detection of arithmetic expressions},
  journal = {The Journal of Supercomputing},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  doi = {10.1007/s11227-020-03469-7}
}
Xie C, Chen J, Firoz JS, Li J, Song SL, Barker K, Raugas M and Li A (2020), "Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures", December, 2020.
Abstract: Designing efficient and scalable sparse linear algebra kernels on modern multi-GPU based HPC systems is a daunting task due to significant irregular memory references and workload imbalance across the GPUs. This is particularly the case for Sparse Triangular Solver (SpTRSV) which introduces additional two-dimensional computation dependencies among subsequent computation steps. Dependency information is exchanged and shared among GPUs, thus warrant for efficient memory allocation, data partitioning, and workload distribution as well as fine-grained communication and synchronization support. In this work, we demonstrate that directly adopting unified memory can adversely affect the performance of SpTRSV on multi-GPU architectures, despite linking via fast interconnect like NVLinks and NVSwitches. Alternatively, we employ the latest NVSHMEM technology based on Partitioned Global Address Space programming model to enable efficient fine-grained communication and drastic synchronization overhead reduction. Furthermore, to handle workload imbalance, we propose a malleable task-pool execution model which can further enhance the utilization of GPUs. By applying these techniques, our experiments on the NVIDIA multi-GPU supernode V100-DGX-1 and DGX-2 systems demonstrate that our design can achieve on average 3.53x (up to 9.86x) speedup on a DGX-1 system and 3.66x (up to 9.64x) speedup on a DGX-2 system with 4-GPUs over the Unified-Memory design. The comprehensive sensitivity and scalability studies also show that the proposed zero-copy SpTRSV is able to fully utilize the computing and communication resources of the multi-GPU system.
BibTeX:
@article{Xie2020,
  author = {Chenhao Xie and Jieyang Chen and Jesun S Firoz and Jiajia Li and Shuaiwen Leon Song and Kevin Barker and Mark Raugas and Ang Li},
  title = {Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures},
  year = {2020}
}
Xin R, Pu S, Nedić A and Khan UA (2020), "A general framework for decentralized optimization with first-order methods", September, 2020.
Abstract: Decentralized optimization to minimize a finite sum of functions over a network of nodes has been a significant focus within control and signal processing research due to its natural relevance to optimal control and signal estimation problems. More recently, the emergence of sophisticated computing and large-scale data science needs have led to a resurgence of activity in this area. In this article, we discuss decentralized first-order gradient methods, which have found tremendous success in control, signal processing, and machine learning problems, where such methods, due to their simplicity, serve as the first method of choice for many complex inference and training tasks. In particular, we provide a general framework of decentralized first-order methods that is applicable to undirected and directed communication networks alike, and show that much of the existing work on optimization and consensus can be related explicitly to this framework. We further extend the discussion to decentralized stochastic first-order methods that rely on stochastic gradients at each node and describe how local variance reduction schemes, previously shown to have promise in the centralized settings, are able to improve the performance of decentralized methods when combined with what is known as gradient tracking. We motivate and demonstrate the effectiveness of the corresponding methods in the context of machine learning and signal processing problems that arise in decentralized environments.
BibTeX:
@article{Xin2020,
  author = {Ran Xin and Shi Pu and Angelia Nedić and Usman A. Khan},
  title = {A general framework for decentralized optimization with first-order methods},
  year = {2020}
}
Xin R, Khan UA and Kar S (2020), "A fast randomized incremental gradient method for decentralized non-convex optimization", November, 2020.
Abstract: We study decentralized non-convex finite-sum minimization problems described over a network of nodes, where each node possesses a local batch of data samples. We propose a single-timescale first-order randomized incremental gradient method, termed as GT-SAGA. GT-SAGA is computationally efficient since it evaluates only one component gradient per node per iteration and achieves provably fast and robust performance by leveraging node-level variance reduction and network-level gradient tracking. For general smooth non-convex problems, we show almost sure and mean-squared convergence to a first-order stationary point and describe regimes of practical significance where GT-SAGA achieves a network-independent convergence rate and outperforms the existing approaches respectively. When the global cost function further satisfies the Polyak-Lojaciewisz condition, we show that GT-SAGA exhibits global linear convergence to an optimal solution in expectation and describe regimes of practical interest where the performance is network-independent and improves upon the existing work. Numerical experiments based on real-world datasets are included to highlight the behavior and convergence aspects of the proposed method.
BibTeX:
@article{Xin2020a,
  author = {Ran Xin and Usman A. Khan and Soummya Kar},
  title = {A fast randomized incremental gradient method for decentralized non-convex optimization},
  year = {2020}
}
Yang D, Liu J and Lai J (2020), "EDGES: An Efficient Distributed GraphEmbedding System on GPU clusters", IEEE Transactions on Parallel and Distributed Systems. Institute of Electrical and Electronics Engineers (IEEE).
Abstract: Graph embedding training models access parameters sparsely in a "one-hot" manner. Currently, the distributed graph embedding neural network is learned by data parallel with the parameter server, which suffers significant performance and scalability problems. In this paper, we analyze the problems and characteristics of training this kind of models on distributed GPU clusters for the first time, and find that fixed model parameters scattered among different machine nodes are a major limiting factor for efficiency.Based on our observation, we develop an efficient distributed graph embedding system called EDGES, which can utilize GPU clusters to train large graph models with billions of nodes and trillions of edges using data and model parallelism. Within the system, we propose a novel dynamic partition architecture for training these models, achieving at least one half of communication reduction compared to existing training systems. According to our evaluations on real-world networks, our system delivers a competitive accuracy for the trained embeddings, and significantly accelerates the training process of the graph node embedding neural network,achieving a speedup of 7.23× and 18.6× over the existing fastest training system on single node and multi-node, respectively. As for the scalability, our experiments show that EDGES obtains a nearly linear speedup.
BibTeX:
@article{Yang2020,
  author = {Dongxu Yang and Junhong Liu and Junjie Lai},
  title = {EDGES: An Efficient Distributed GraphEmbedding System on GPU clusters},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
  year = {2020},
  doi = {10.1109/tpds.2020.3041219}
}
Yaşar A, Balin MF, An X, Sancak K and Çatalyürek Ü (2020), "On Symmetric Rectilinear Matrix Partitioning", September, 2020.
Abstract: Even distribution of irregular workload to processing units is crucial for efficient parallelization in many applications. In this work, we are concerned with a spatial partitioning called rectilinear partitioning (also known as generalized block distribution) of sparse matrices. More specifically, in this work, we address the problem of symmetric rectilinear partitioning of a square matrix. By symmetric, we mean the rows and columns of the matrix are identically partitioned yielding a tiling where the diagonal tiles (blocks) will be squares. We first show that the optimal solution to this problem is NP-hard, and we propose four heuristics to solve two different variants of this problem. We present a thorough analysis of the computational complexities of those proposed heuristics. To make the proposed techniques more applicable in real life application scenarios, we further reduce their computational complexities by utilizing effective sparsification strategies together with an efficient sparse prefix-sum data structure. We experimentally show the proposed algorithms are efficient and effective on more than six hundred test matrices. With sparsification, our methods take less than 3 seconds in the Twitter graph on a modern 24 core system and output a solution whose load imbalance is no worse than 1%.
BibTeX:
@article{Yasar2020,
  author = {Abdurrahman Yaşar and Muhammed Fatih Balin and Xiaojing An and Kaan Sancak and ÜmitV. Çatalyürek},
  title = {On Symmetric Rectilinear Matrix Partitioning},
  year = {2020}
}
Yaşar A, Rajamanickam S, Berry J and Çatalyürek ÜV (2020), "A Block-Based Triangle Counting Algorithm on Heterogeneous Environments", September, 2020.
Abstract: Triangle counting is a fundamental building block in graph algorithms. In this paper, we propose a block-based triangle counting algorithm to reduce data movement during both sequential and parallel execution. Our block-based formulation makes the algorithm naturally suitable for heterogeneous architectures. The problem of partitioning the adjacency matrix of a graph is well-studied. Our task decomposition goes one step further: it partitions the set of triangles in the graph. By streaming these small tasks to compute resources, we can solve problems that do not fit on a device. We demonstrate the effectiveness of our approach by providing an implementation on a compute node with multiple sockets, cores and GPUs. The current state-of-the-art in triangle enumeration processes the Friendster graph in 2.1 seconds, not including data copy time between CPU and GPU. Using that metric, our approach is 20 percent faster. When copy times are included, our algorithm takes 3.2 seconds. This is 5.6 times faster than the fastest published CPU-only time.
BibTeX:
@article{Yasar2020a,
  author = {Abdurrahman Yaşar and Sivasankaran Rajamanickam and Jonathan Berry and Ümit V. Çatalyürek},
  title = {A Block-Based Triangle Counting Algorithm on Heterogeneous Environments},
  year = {2020}
}
Yelick K, Buluç A, Awan M, Azad A, Brock B, Egan R, Ekanayake S, Ellis M, Georganas E, Guidi G, Hofmeyr S, Selvitopi O, Teodoropol C and Oliker L (2020), "The parallelism motifs of genomic data analysis", Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences., 1, 2020. Vol. 378(2166), pp. 20190394. The Royal Society.
Abstract: Genomic datasets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share these data with the research community, but some of these genomic data analysis problems require large-scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high-end parallel systems today and place different requirements on programming support, software libraries and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high-performance genomics analysis, including alignment, profiling, clustering and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or ‘motifs' that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing.
BibTeX:
@article{Yelick2020,
  author = {Katherine Yelick and Aydın Buluç and Muaaz Awan and Ariful Azad and Benjamin Brock and Rob Egan and Saliya Ekanayake and Marquita Ellis and Evangelos Georganas and Giulia Guidi and Steven Hofmeyr and Oguz Selvitopi and Cristina Teodoropol and Leonid Oliker},
  title = {The parallelism motifs of genomic data analysis},
  journal = {Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences},
  publisher = {The Royal Society},
  year = {2020},
  volume = {378},
  number = {2166},
  pages = {20190394},
  doi = {10.1098/rsta.2019.0394}
}
Yi X, Zhang S, Yang T, Chai T and Johansson KH (2020), "A Primal-Dual SGD Algorithm for Distributed Nonconvex Optimization", June, 2020.
Abstract: The distributed nonconvex optimization problem of minimizing a global cost function formed by a sum of n local cost functions by using local information exchange is considered. This problem is an important component of many machine learning techniques with data parallelism, such as deep learning and federated learning. We propose a distributed primal-dual stochastic gradient descent (SGD) algorithm, suitable for arbitrarily connected communication networks and any smooth (possibly nonconvex) cost functions. We show that the proposed algorithm achieves the linear speedup convergence rate &Oscr;(1/nT) for general nonconvex cost functions and the well known &Oscr;(1/T) convergence rate when the global cost function satisfies the Polyak-&Lstrok;ojasiewicz condition, where T is the total number of iterations. We also show that the output of the proposed algorithm with fixed parameters linearly converges to a neighborhood of a global optimum. We demonstrate through numerical experiments the efficiency of our algorithm in comparison with the baseline centralized SGD and recently proposed distributed SGD algorithms.
BibTeX:
@article{Yi2020,
  author = {Xinlei Yi and Shengjun Zhang and Tao Yang and Tianyou Chai and Karl H. Johansson},
  title = {A Primal-Dual SGD Algorithm for Distributed Nonconvex Optimization},
  year = {2020}
}
You Y, Wang Y, Zhang H, Zhang Z, Demmel J and Hsieh C-J (2020), "The Limit of the Batch Size", June, 2020.
Abstract: Large-batch training is an efficient approach for current distributed deep learning systems. It has enabled researchers to reduce the ImageNet/ResNet-50 training from 29 hours to around 1 minute. In this paper, we focus on studying the limit of the batch size. We think it may provide a guidance to AI supercomputer and algorithm designers. We provide detailed numerical optimization instructions for step-by-step comparison. Moreover, it is important to understand the generalization and optimization performance of huge batch training. Hoffer et al. introduced "ultra-slow diffusion" theory to large-batch training. However, our experiments show contradictory results with the conclusion of Hoffer et al. We provide comprehensive experimental results and detailed analysis to study the limitations of batch size scaling and "ultra-slow diffusion" theory. For the first time we scale the batch size on ImageNet to at least a magnitude larger than all previous work, and provide detailed studies on the performance of many state-of-the-art optimization schemes under this setting. We propose an optimization recipe that is able to improve the top-1 test accuracy by 18% compared to the baseline.
BibTeX:
@article{You2020,
  author = {Yang You and Yuhui Wang and Huan Zhang and Zhao Zhang and James Demmel and Cho-Jui Hsieh},
  title = {The Limit of the Batch Size},
  year = {2020}
}
You Y, He Y, Rajbhandari S, Wang W, Hsieh C-J, Keutzer K and Demmel J (2020), "Fast LSTM by dynamic decomposition on cloud and distributed systems", Knowledge and Information Systems., 7, 2020. Springer Science and Business Media LLC.
Abstract: Long short-term memory (LSTM) is a powerful deep learning technique that has been widely used in many real-world data-mining applications such as language modeling and machine translation. In this paper, we aim to minimize the latency of LSTM inference on cloud systems without losing accuracy. If an LSTM model does not fit in cache, the latency due to data movement will likely be greater than that due to computation. In this case, we reduce model parameters. If, as in most applications we consider, the LSTM models are able to fit the cache of cloud server processors, we focus on reducing the number of floating point operations, which has a corresponding linear impact on the latency of the inference calculation. Thus, in our system, we dynamically reduce model parameters or flops depending on which most impacts latency. Our inference system is based on singular value decomposition and canonical polyadic decomposition. Our system is accurate and low latency. We evaluate our system based on models from a series of real-world applications like language modeling, computer vision, question answering, and sentiment analysis. Users of our system can use either pre-trained models or start from scratch. Our system achieves 15× average speedup for six real-world applications without losing accuracy in inference. We also design and implement a distributed optimization system with dynamic decomposition, which can significantly reduce the energy cost and accelerate the training process.
BibTeX:
@article{You2020a,
  author = {Yang You and Yuxiong He and Samyam Rajbhandari and Wenhan Wang and Cho-Jui Hsieh and Kurt Keutzer and James Demmel},
  title = {Fast LSTM by dynamic decomposition on cloud and distributed systems},
  journal = {Knowledge and Information Systems},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  doi = {10.1007/s10115-020-01487-8}
}
You X, Yang H, Luan Z, Qian D and Liu X (2020), "Zerospy: Exploring Software Inefficiency with Redundant Zeros", In Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis. Los Alamitos, CA, USA, 11, 2020. , pp. 397-410. IEEE Computer Society.
Abstract: Redundant zeros cause inefficiencies in which the zero values are loaded and computed repeatedly, resulting in unnecessary memory traffic and identity computation that waste memory bandwidth and CPU resources. Optimizing compilers is difficult in eliminating these zero-related inefficiencies due to limitations in static analysis. Hardware approaches, in contrast, optimize inefficiencies without code modification, but are not widely adopted in commodity processors. In this paper, we propose ZeroSpy - a fine-grained profiler to identify redundant zeros caused by both inappropriate use of data structures and useless computation. ZeroSpy also provides intuitive optimization guidance by revealing the locations where the redundant zeros happen in source lines and calling contexts. The experimental results demonstrate ZeroSpy is capable of identifying redundant zeros in programs that have been highly optimized for years. Based on the optimization guidance revealed by ZeroSpy, we can achieve significant speedups after eliminating redundant zeros.
BibTeX:
@inproceedings{You2020b,
  author = {X. You and H. Yang and Z. Luan and D. Qian and X. Liu},
  title = {Zerospy: Exploring Software Inefficiency with Redundant Zeros},
  booktitle = {Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis},
  publisher = {IEEE Computer Society},
  year = {2020},
  pages = {397-410},
  url = {https://doi.ieeecomputersociety.org/10.1109/SC41405.2020.00033},
  doi = {10.1109/SC41405.2020.00033}
}
Yu T and Zhu H (2020), "Hyper-Parameter Optimization: A Review of Algorithms and Applications", March, 2020.
Abstract: Since deep neural networks were developed, they have made huge contributions to everyday lives. Machine learning provides more rational advice than humans are capable of in almost every aspect of daily life. However, despite this achievement, the design and training of neural networks are still challenging and unpredictable procedures. To lower the technical thresholds for common users, automated hyper-parameter optimization (HPO) has become a popular topic in both academic and industrial areas. This paper provides a review of the most essential topics on HPO. The first section introduces the key hyper-parameters related to model training and structure, and discusses their importance and methods to define the value range. Then, the research focuses on major optimization algorithms and their applicability, covering their efficiency and accuracy especially for deep learning networks. This study next reviews major services and toolkits for HPO, comparing their support for state-of-the-art searching algorithms, feasibility with major deep learning frameworks, and extensibility for new modules designed by users. The paper concludes with problems that exist when HPO is applied to deep learning, a comparison between optimization algorithms, and prominent approaches for model evaluation with limited computational resources.
BibTeX:
@article{Yu2020,
  author = {Tong Yu and Hong Zhu},
  title = {Hyper-Parameter Optimization: A Review of Algorithms and Applications},
  year = {2020}
}
Yuan R, Lazaric A and Gower RM (2020), "Sketched Newton-Raphson", June, 2020.
Abstract: We propose a new globally convergent stochastic second order method. Our starting point is the development of a new Sketched Newton-Raphson (SNR) method for solving large scale nonlinear equations of the form F(x)=0 with F: &reals;^d → &reals;^d. We then show how to design several stochastic second order optimization methods by re-writing the optimization problem of interest as a system of nonlinear equations and applying SNR. For instance, by applying SNR to find a stationary point of a generalized linear model (GLM), we derive completely new and scalable stochastic second order methods. We show that the resulting method is very competitive as compared to state-of-the-art variance reduced methods. Using a variable splitting trick, we also show that the Stochastic Newton method (SNM) is a special case of SNR, and use this connection to establish the first global convergence theory of SNM. Indeed, by showing that SNR can be interpreted as a variant of the stochastic gradient descent (SGD) method we are able to leverage proof techniques of SGD and establish a global convergence theory and rates of convergence for SNR. As a special case, our theory also provides a new global convergence theory for the original Newton-Raphson method under strictly weaker assumptions as compared to what is commonly used for global convergence. There are many ways to re-write an optimization problem as nonlinear equations. Each re-write would lead to a distinct method when using SNR. As such, we believe that SNR and its global convergence theory will open the way to designing and analysing a host of new stochastic second order methods.
BibTeX:
@article{Yuan2020,
  author = {Rui Yuan and Alessandro Lazaric and Robert M. Gower},
  title = {Sketched Newton-Raphson},
  year = {2020}
}
Zachariadis O, Satpute N, Gómez-Luna J and Olivares J (2020), "Accelerating sparse matrix–matrix multiplication with GPU Tensor Cores", Computers & Electrical Engineering., 12, 2020. Vol. 88, pp. 106848. Elsevier BV.
Abstract: parse general matrix–matrix multiplication (spGEMM) is an essential component in many scientific and data analytics applications. However, the sparsity pattern of the input matrices and the interaction of their patterns make spGEMM challenging. Modern GPUs include Tensor Core Units (TCUs), which specialize in dense matrix multiplication. Our aim is to re-purpose TCUs for sparse matrices. The key idea of our spGEMM algorithm, tSparse, is to multiply sparse rectangular blocks using the mixed precision mode of TCUs. tSparse partitions the input matrices into tiles and operates only on tiles which contain one or more elements. It creates a task list of the tiles, and performs matrix multiplication of these tiles using TCUs. To the best of our knowledge, this is the first time that TCUs are used in the context of spGEMM. We show that spGEMM, with our tiling approach, benefits from TCUs. Our approach significantly improves the performance of spGEMM in comparison to cuSPARSE, CUSP, RMerge2, Nsparse, AC-SpGEMM and spECK.
BibTeX:
@article{Zachariadis2020,
  author = {Orestis Zachariadis and Nitin Satpute and Juan Gómez-Luna and Joaqu\in Olivares},
  title = {Accelerating sparse matrix–matrix multiplication with GPU Tensor Cores},
  journal = {Computers & Electrical Engineering},
  publisher = {Elsevier BV},
  year = {2020},
  volume = {88},
  pages = {106848},
  doi = {10.1016/j.compeleceng.2020.106848}
}
Zanon M, Zambonin G, Susto GA and McLoone S (2020), "Sparse Logistic Regression: Comparison of Regularization and Bayesian Implementations", Algorithms., 6, 2020. Vol. 13(6), pp. 137. MDPI AG.
Abstract: In knowledge-based systems, besides obtaining good output prediction accuracy, it is crucial to understand the subset of input variables that have most influence on the output, with the goal of gaining deeper insight into the underlying process. These requirements call for logistic model estimation techniques that provide a sparse solution, i.e., where coefficients associated with non-important variables are set to zero. In this work we compare the performance of two methods: the first one is based on the well known Least Absolute Shrinkage and Selection Operator (LASSO) which involves regularization with an l1 norm; the second one is the Relevance Vector Machine (RVM) which is based on a Bayesian implementation of the linear logistic model. The two methods are extensively compared in this paper, on real and simulated datasets. Results show that, in general, the two approaches are comparable in terms of prediction performance. RVM outperforms the LASSO both in term of structure recovery (estimation of the correct non-zero model coefficients) and prediction accuracy when the dimensionality of the data tends to increase. However, LASSO shows comparable performance to RVM when the dimensionality of the data is much higher than number of samples that is p &Gt; n
BibTeX:
@article{Zanon2020,
  author = {Mattia Zanon and Giuliano Zambonin and Gian Antonio Susto and Seán McLoone},
  title = {Sparse Logistic Regression: Comparison of Regularization and Bayesian Implementations},
  journal = {Algorithms},
  publisher = {MDPI AG},
  year = {2020},
  volume = {13},
  number = {6},
  pages = {137},
  doi = {10.3390/a13060137}
}
Zeni A, Guidi G, Ellis M, Ding N, Santambrogio MD, Hofmeyr S, Buluç A, Oliker L and Yelick K (2020), "LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment", Proceedings of the 34th IEEE International Parallel and Distributed Processing Symposium., February, 2020.
Abstract: Pairwise sequence alignment is one of the most computationally intensive kernels in genomic data analysis, accounting for more than 90% of the runtime for key bioinformatics applications. This method is particularly expensive for third-generation sequences due to the high computational cost of analyzing sequences of length between 1Kb and 1Mb. Given the quadratic overhead of exact pairwise algorithms for long alignments, the community primarily relies on approximate algorithms that search only for high-quality alignments and stop early when one is not found. In this work, we present the first GPU optimization of the popular X-drop alignment algorithm, that we named LOGAN. Results show that our high-performance multi-GPU implementation achieves up to 181.6 GCUPS and speed-ups up to 6.6× and 30.7× using 1 and 6 NVIDIA Tesla V100, respectively, over the state-of-the-art software running on two IBM Power9 processors using 168 CPU threads, with equivalent accuracy. We also demonstrate a 2.3× LOGAN speed-up versus ksw2, a state-of-art vectorized algorithm for sequence alignment implemented in minimap2, a long-read mapping software. To highlight the impact of our work on a real-world application, we couple LOGAN with a many-to-many long-read alignment software called BELLA, and demonstrate that our implementation improves the overall BELLA runtime by up to 10.6×. Finally, we adapt the Roofline model for LOGAN and demonstrate that our implementation is near-optimal on the NVIDIA Tesla V100s.
BibTeX:
@inproceedings{Zeni2020,
  author = {Alberto Zeni and Giulia Guidi and Marquita Ellis and Nan Ding and Marco D. Santambrogio and Steven Hofmeyr and Aydın Buluç and Leonid Oliker and Katherine Yelick},
  title = {LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment},
  journal = {Proceedings of the 34th IEEE International Parallel and Distributed Processing Symposium},
  year = {2020}
}
Zhang J, Lin H, Sra S and Jadbabaie A (2020), "On Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions", February, 2020.
Abstract: We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonconvex functions. In particular, we study the class of Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions for which the chain rule of calculus holds. This class contains important examples such as ReLU neural networks and others with non-differentiable activation functions. First, we show that finding an 𝜖-stationary point with first-order methods is impossible in finite time. Therefore, we introduce the notion of {(\delta, \epsilon)}-stationarity, a generalization that allows for a point to be within distance δ of an 𝜖-stationary point and reduces to 𝜖-stationarity for smooth functions. We propose a series of randomized first-order methods and analyze their complexity of finding a (, )-stationary point. Furthermore, we provide a lower bound and show that our stochastic algorithm has min-max optimal dependence on δ. Empirically, our methods perform well for training ReLU neural networks.
BibTeX:
@article{Zhang2020,
  author = {Jingzhao Zhang and Hongzhou Lin and Suvrit Sra and Ali Jadbabaie},
  title = {On Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions},
  year = {2020}
}
Zhang Y, Sahinidis NV, Nohra C and Rong G (2020), "Optimality-based domain reduction for inequality-constrained NLP and MINLP problems", Journal of Global Optimization., 2, 2020. Springer Science and Business Media LLC.
Abstract: In spatial branch-and-bound algorithms, optimality-based domain reduction is normally performed after solving a node and relies on duality information to reduce ranges of variables. In this work, we propose novel optimality conditions for NLP and MINLP problems and apply them for domain reduction prior to solving a node in branch-and-bound. The conditions apply to nonconvex inequality-constrained problems for which we exploit monotonicity properties of objectives and constraints. We develop three separate reduction algorithms for unconstrained, one-constraint, and multi-constraint problems. We use the optimality conditions to reduce ranges of variables through forward and backward bound propagation of gradients respective to each decision variable. We describe an efficient implementation of these techniques in the branch-and-bound solver BARON. The implementation dynamically recognizes and ignores inactive constraints at each node of the search tree. Our computations demonstrate that the proposed techniques often reduce the solution time and total number of nodes for continuous problems; they are less effective for mixed-integer programs.
BibTeX:
@article{Zhang2020a,
  author = {Yi Zhang and Nikolaos V. Sahinidis and Carlos Nohra and Gang Rong},
  title = {Optimality-based domain reduction for inequality-constrained NLP and MINLP problems},
  journal = {Journal of Global Optimization},
  publisher = {Springer Science and Business Media LLC},
  year = {2020},
  doi = {10.1007/s10898-020-00886-z}
}
Zhang Z, Wang H, Han S and Dally WJ (2020), "SpArch: Efficient Architecture for Sparse Matrix Multiplication", February, 2020.
Abstract: Generalized Sparse Matri×-Matri× Multiplication (SpGEMM) is a ubiquitous task in various engineering and scientific applications. However, inner product based SpGENN introduces redundant input fetches for mismatched nonzero operands, while outer product based approach suffers from poor output locality due to numerous partial product matrices. Inefficiency in the reuse of either inputs or outputs data leads to e×tensive and e×pensive DRAM access. To address this problem, this paper proposes an efficient sparse matri× multiplication accelerator architecture, SpArch, which jointly optimizes the data locality for both input and output matrices. We first design a highly parallelized streaming-based merger to pipeline the multiply and merge stage of partial matrices so that partial matrices are merged on chip immediately after produced. We then propose a condensed matri× representation that reduces the number of partial matrices by three orders of magnitude and thus reduces DRAM access by 5.4×. We further develop a Huffman tree scheduler to improve the scalability of the merger for larger sparse matrices, which reduces the DRAM access by another 1.8×. We also resolve the increased input matri× read induced by the new representation using a row prefetcher with near-optimal buffer replacement policy, further reducing the DRAM access by 1.5×. Evaluated on 20 benchmarks, SpArch reduces the total DRAM access by 2.8× over previous state-of-the-art. On average, SpArch achieves 4×, 19×, 18×, 17×, 1285× speedup and 6×, 164×, 435×, 307×, 62× energy savings over OuterSPACE, MKL, cuSPARSE, CUSP, and ARM Armadillo, respectively.
BibTeX:
@article{Zhang2020b,
  author = {Zhekai Zhang and Hanrui Wang and Song Han and William J. Dally},
  title = {SpArch: Efficient Architecture for Sparse Matrix Multiplication},
  year = {2020}
}
Zhang Q, Huang F, Deng C and Huang H (2020), "Faster Stochastic Quasi-Newton Methods", April, 2020.
Abstract: Recently, stochastic optimization methods are a class of powerful optimization tools in machine learning. Stochastic gradient descent (SGD) is one of the representative stochastic methods and is widely used for many machine learning problems. However, SGD only uses the first-order information of problems to optimize them, which results in its some limitations such as its solutions without high accuracy. Thus, stochastic quasi-Newton methods recently have been widely concerned due to utilizing approximate Hessian information, which is more robust and can achieve better accuracy than stochastic first-order methods. Considering that existing stochastic quasi-Newton methods still do not reach the best known stochastic first-order oracle (SFO) complexity, thus, we propose a novel faster stochastic quasi-Newton method (SpiderSQN) based on the variance reduced technique of SIPDER. Moreover, we prove that our SpiderSQN method reach the best known SFO complexity of &Oscr;(n+n^1/2-2) in the finite-sum setting to obtain an 𝜖-first-order stationary point. To further improve its practical performance, we incorporate SpiderSQN with different effective momentum schemes. Moreover, the proposed algorithms are generalized to the online setting, and the corresponding SFO complexity of &Oscr;(-3) is developed, which matches the existing best result. Extensive experiments on benchmark datasets demonstrate that the proposed SpiderSQN-type of algorithms outperform state-of-the-art algorithms for nonconvex optimization.
BibTeX:
@article{Zhang2020c,
  author = {Qingsong Zhang and Feihu Huang and Cheng Deng and Heng Huang},
  title = {Faster Stochastic Quasi-Newton Methods},
  year = {2020}
}
Zhang G, Allaire D and Cagan J (2020), "An Initial Guess Free Method for Least Squares Parameter Estimation in Nonlinear Models" Unpublished.
Abstract: Fitting models to data is critical in many science and engineering fields. A major task in fitting models to data is to estimate the value of each parameter in a given model. Iterative methods, such as the Gauss-Newton method and the Levenberg-Marquardt method, are often employed for parameter estimation in nonlinear models. However, practitioners must guess the initial value for each parameter in order to initialize these iterative methods. A poor initial guess can contribute to non-convergence of these methods or lead these methods to converge to a wrong solution. In this paper, an initial guess free method is introduced to find the optimal parameter estimators in a nonlinear model that minimizes the squared error of the fit. The method includes three algorithms that require different level of computational power to find the optimal parameter estimators. The method constructs a solution interval for each parameter in the model. These solution intervals significantly reduce the search space for optimal parameter estimators. The method also provides an empirical probability distribution for each parameter, which is valuable for parameter uncertainty assessment. The initial guess free method is validated through a case study in which Fick's second law is fit to an experimental data set. This case study shows that the initial guess free method can find the optimal parameter estimators efficiently. A four-step procedure for implementing the initial guess free method in practice is also outlined.
BibTeX:
@article{Zhang2020d,
  author = {Zhang, Guanglu and Allaire, Douglas and Cagan, Jonathan},
  title = {An Initial Guess Free Method for Least Squares Parameter Estimation in Nonlinear Models},
  publisher = {Unpublished},
  year = {2020},
  doi = {10.13140/RG.2.2.32573.82402}
}
Zhang Y, Azad A and Buluç A (2020), "Parallel algorithms for finding connected components using linear algebra", Journal of Parallel and Distributed Computing., 5, 2020. Elsevier BV.
Abstract: Finding connected components is one of the most widely used operations on a graph. Optimal serial algorithms for the problem have been known for half a century, and many competing parallel algorithms have been proposed over the last several decades under various different models of parallel computation. This paper presents a class of parallel connected-component algorithms designed using linear-algebraic primitives. These algorithms are based on a PRAM algorithm by Shiloach and Vishkin and can be designed using standard GraphBLAS operations. We demonstrate two algorithms of this class, one named LACC for Linear Algebraic Connected Components, and the other named FastSV which can be regarded as LACC's simplification. With the support of the highly-scalable Combinatorial BLAS library, LACC and FastSV outperform the previous state-of-the-art algorithm by a factor of up to 12x for small to medium scale graphs. For large graphs with more than 50B edges, LACC and FastSV scale to 4K nodes (262K cores) of a Cray XC40 supercomputer and outperform previous algorithms by a significant margin. This remarkable performance is accomplished by (1) exploiting sparsity that was not present in the original PRAM algorithm formulation, (2) using high-performance primitives of Combinatorial BLAS, and (3) identifying hot spots and optimizing them away by exploiting algorithmic insights.
BibTeX:
@article{Zhang2020e,
  author = {Yongzhe Zhang and Ariful Azad and Ayd&imath;n Buluç},
  title = {Parallel algorithms for finding connected components using linear algebra},
  journal = {Journal of Parallel and Distributed Computing},
  publisher = {Elsevier BV},
  year = {2020},
  doi = {10.1016/j.jpdc.2020.04.009}
}
Zhang Y, Zhao Z and Feng Z (2020), "SF-GRASS: Solver-Free Graph Spectral Sparsification", August, 2020.
Abstract: Recent spectral graph sparsification techniques have shown promising performance in accelerating many numerical and graph algorithms, such as iterative methods for solving large sparse matrices, spectral partitioning of undirected graphs, vectorless verification of power/thermal grids, representation learning of large graphs, etc. However, prior spectral graph sparsification methods rely on fast Laplacian matrix solvers that are usually challenging to implement in practice. This work, for the first time, introduces a solver-free approach (SF-GRASS) for spectral graph sparsification by leveraging emerging spectral graph coarsening and graph signal processing (GSP) techniques. We introduce a local spectral embedding scheme for efficiently identifying spectrally-critical edges that are key to preserving graph spectral properties, such as the first few Laplacian eigenvalues and eigenvectors. Since the key kernel functions in SF-GRASS can be efficiently implemented using sparse-matrix-vector-multiplications (SpMVs), the proposed spectral approach is simple to implement and inherently parallel friendly. Our extensive experimental results show that the proposed method can produce a hierarchy of high-quality spectral sparsifiers in nearly-linear time for a variety of real-world, large-scale graphs and circuit networks when compared with the prior state-of-the-art spectral method.
BibTeX:
@article{Zhang2020f,
  author = {Ying Zhang and Zhiqiang Zhao and Zhuo Feng},
  title = {SF-GRASS: Solver-Free Graph Spectral Sparsification},
  year = {2020}
}
Zhang C (2020), "A New Perspective of Graph Data and A Generic and Efficient Method for Large Scale Graph Data Traversal", September, 2020.
Abstract: The BFS algorithm is a basic graph data processing algorithm and many other graph data processing algorithms have similar architectural features with BFS algorithm and can be built on the basis of BFS algorithm model. We analyze the differences between graph algorithms and traditional high-performance algorithms in detail, propose a new way of classifying algorithms into data independent algorithm and data correlation algorithm based on their run-time correlation with data, and use this new classification to explain the validity of the methods proposed in this paper. Through a deeper analysis of graph data, we propose a new fundamental perspective on understanding graph data, establishing a link between two basic data structures, graph and tree, and viewing graph data as consisting of smaller subgraphs and edge trees. Small degree vertices are found to be one of important cause of random memory access. Based on this, we propose a general, easy to implement, and efficient method for graph data processing, with the basic idea of treating low-degree vertices and core subgraphs separately, thus significantly reducing the size of random memory access and improving the efficiency of memory access. Finally, we evaluated the performance of the method on three major data center computing platforms (Intel, AMD, and ARM), and the experiments showed that it brought 19.7%, 31.8% and 17.9% performance improvement, respectively, with a performance-power ratio of 282.70 MTEPS/s on the ARM platform, ranking it among the Green graph500 in November 2019. World No. 1 on the big dataset list.
BibTeX:
@article{Zhang2020g,
  author = {Chenglong Zhang},
  title = {A New Perspective of Graph Data and A Generic and Efficient Method for Large Scale Graph Data Traversal},
  year = {2020}
}
Zhang G, Allaire D and Cagan J (2020), "Taking the Guess Work Out of the Initial Guess: A Solution Interval Method for Least Squares Parameter Estimation in Nonlinear Models", Journal of Computing and Information Science in Engineering., 10, 2020. , pp. 1-61. ASME International.
Abstract: Fitting a specified model to data is critical in many science and engineering fields. A major task in fitting a specified model to data is to estimate the value of each parameter in the model. Iterative local methods, such as the Gauss-Newton method and the Levenberg-Marquardt method, are often employed for parameter estimation in nonlinear models. However, practitioners must guess the initial value for each parameter to initialize these iterative local methods. A poor initial guess can contribute to non-convergence of these methods or lead these methods to converge to a wrong or inferior solution. In this paper, a solution interval method is introduced to find the optimal estimator for each parameter in a nonlinear model that minimizes the squared error of the fit. The method includes three algorithms that require different level of computational power to find the optimal parameter estimators. The method constructs a solution interval for each parameter in the model. These solution intervals significantly reduce the search space for optimal parameter estimators. The method also provides an empirical probability distribution for each parameter, which is valuable for parameter uncertainty assessment. The solution interval method is validated through two case studies in which the Michaelis-Menten model and Fick's second law are fit to experimental data sets, respectively. These case studies show that the solution interval method can find optimal parameter estimators efficiently. A four-step procedure for implementing the solution interval method in practice is also outlined.
BibTeX:
@article{Zhang2020h,
  author = {Guanglu Zhang and Douglas Allaire and Jonathan Cagan},
  title = {Taking the Guess Work Out of the Initial Guess: A Solution Interval Method for Least Squares Parameter Estimation in Nonlinear Models},
  journal = {Journal of Computing and Information Science in Engineering},
  publisher = {ASME International},
  year = {2020},
  pages = {1--61},
  doi = {10.1115/1.4048811}
}
Zhao H, Xia T, Li C, Zhao W, Zheng N and Ren P (2020), "Exploring Better Speculation and Data Locality in Sparse Matrix-Vector Multiplication on Intel Xeon", In Proceedings of the 38th International Conference on Computer Design., October, 2020. IEEE.
Abstract: Sparse Matrix-Vector Multiplication (SpMV) is a fundamental workload of numerous applications. However, for today's high-end superscalar CPUs, such as Intel Xeon series, it is usually difficult to efficiently perform SpMV due to the irregular, matrix-dependent data access and computation pattern. While many researches focus on optimizing the memory bandwidth bound by improving data locality, this work dives into the execution of SpMV computation on Intel Xeon CPU and reveals that the bad-speculation penalty is significant in many sparse matrices and too expensive to be ignored. We study and characterize sparsity structure types that are more vulnerable to the cache miss penalty or the bad speculation penalty, respectively. Based on this insight, we proposed a fast preprocessing method, which divides the matrix into sub-matrices and determines the critical performance bound of sub-matrices according to the data distribution characteristics. On each submatrix, a combination of dedicated row reordering strategies is performed to efficiently alleviate its key performance bounds: bad speculation, cache miss, or both. Our matrix representation is based on standard Compressed Sparse Row (CSR) format, and can be easily adapted to existing SpMV libraries. Our approach is evaluated on Intel Xeon Gold 6146 Processor with a wide-range of matrices from the SuiteSparse benchmarks. The results demonstrate that the proposed approach achieves an average 1.8× speedup (up to 2.5×) on multi-threaded MKL Sparse Routines, with a quite low pre-processing cost. Additionally, when used in conjunction with MKL's original optimization method, our approach can further prompt the speedup, to average 3.6 × (up to 8.3 ×), This result indicates that our method can serve as a fast and wide-spectrum optimization method which is compatible with existing routines.
BibTeX:
@inproceedings{Zhao2020,
  author = {Haoran Zhao and Tian Xia and Chenyang Li and Wenzhe Zhao and Nanning Zheng and Pengju Ren},
  title = {Exploring Better Speculation and Data Locality in Sparse Matrix-Vector Multiplication on Intel Xeon},
  booktitle = {Proceedings of the 38th International Conference on Computer Design},
  publisher = {IEEE},
  year = {2020},
  doi = {10.1109/iccd50377.2020.00105}
}
Zheng Q, Xi Y and Saad Y (2020), "A power Schur complement Low-Rank correction preconditioner for general sparse linear systems", February, 2020.
Abstract: An effective power based parallel preconditioner is proposed for general large sparse linear systems. The preconditioner combines a power series expansion method with some low-rank correction techniques, where the Sherman-Morrison-Woodbury formula is utilized. A matrix splitting of the Schur complement is proposed to expand the power series. The number of terms used in the power series expansion can control the approximation accuracy of the preconditioner to the inverse of the Schur complement. To construct the preconditioner, graph partitioning is invoked to reorder the original coefficient matrix, leading to a special block two-by-two matrix whose two off-diagonal submatrices are block diagonal. Variables corresponding to interface variables are obtained by solving a linear system with the coeffcient matrix being the Schur complement. For the variables related to the interior variables, one only needs to solve a block diagonal linear system. This can be performed efficiently in parallel. Various numerical examples are provided to illustrate that the efficiency of the proposed preconditioner.
BibTeX:
@article{Zheng2020,
  author = {Qingqing Zheng and Yuanzhe Xi and Yousef Saad},
  title = {A power Schur complement Low-Rank correction preconditioner for general sparse linear systems},
  year = {2020}
}
Zhou K, Krentel M and Mellor-Crummey J (2020), "A tool for top-down performance analysis of GPU-accelerated applications", In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming., 2, 2020. ACM.
Abstract: To support performance measurement and analysis of GPU-accelerated applications, we extended the HPCToolkit performance tools with several novel features. To support efficient monitoring of accelerated applications, HPCToolkit employs a new wait-free data structure to coordinate measurement and attribution between each application thread and a GPU monitor thread. To help developers understand the performance of accelerated applications, HPCToolkit attributes metrics to heterogeneous calling contexts that span both CPUs and GPUs. To support fine-grain analysis and tuning of GPU-accelerated code, HPCToolkit collects PC samples of both CPU and GPU activity to derive and attribute metrics at all levels in a heterogeneous calling context.
BibTeX:
@inproceedings{Zhou2020,
  author = {Keren Zhou and Mark Krentel and John Mellor-Crummey},
  title = {A tool for top-down performance analysis of GPU-accelerated applications},
  booktitle = {Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
  publisher = {ACM},
  year = {2020},
  doi = {10.1145/3332466.3374534}
}
Qianqian Zhu AZ (2020), "Massively Parallel, Highly Efficient, but What About the Test Suite Quality? Applying Mutation Testing to GPU Programs", In Proceedings of the International Conference on Software Testing, Verification, and Validation.
Abstract: Thanks to rapid advances in programmability and performance, GPUs have been widely applied in HighPerformance Computing (HPC) and safety-critical domains. As such, quality assurance of GPU applications has gained increasing attention. This brings us to mutation testing, a fault-based testing technique that assesses the test suite quality by systematically introducing small artificial faults. It has been shown to perform well in exposing faults. In this paper, we investigate whether GPU programming can benefit from mutation testing. In addition to conventional mutation operators, we propose nine GPU-specific mutation operators based on the core syntax differences between CPU and GPU programming. We conduct a preliminary study on six CUDA systems. The results show that mutation testing can effectively evaluate the test quality of GPU programs: conventional mutation operators can guide the engineers to write simple direct tests, while GPU-specific mutation operators can lead to more intricate test cases which are better at revealing GPU-specific weaknesses.
BibTeX:
@inproceedings{Zhu2020,
  author = {Qianqian Zhu, Andy Zaidman},
  title = {Massively Parallel, Highly Efficient, but What About the Test Suite Quality? Applying Mutation Testing to GPU Programs},
  booktitle = {Proceedings of the International Conference on Software Testing, Verification, and Validation},
  year = {2020}
}
Zhu Y, Liu Y and Zhang G (2020), "FT-PBLAS: PBLAS-based Fault-tolerant Linear Algebra Computation on High-performance Computing Systems", IEEE Access. , pp. 1-1.
Abstract: As high-performance computing (HPC) systems have scaled up, resilience has become a great challenge. To guarantee resilience, various kinds of hardware and software techniques have been proposed. However, among popular software fault-tolerant techniques, both the checkpoint-restart approach and the replication technique face challenges of scalability in the era of peta- and exa-scale systems due to their numerous processes. In this situation, algorithm-based approaches, or algorithm-based fault tolerance (ABFT) mechanisms, have become attractive because they are efficient and lightweight. Although the ABFT technique is algorithm-dependent, it is possible to implement it at a low level (e.g., in libraries for basic numerical algorithms) and make it application-independent. However, previous ABFT approaches have mainly aimed at achieving fault tolerance in integrated circuits (ICs) or at the architecture level and are therefore not suitable for HPC systems; e.g., they use checksums of rows and columns of matrices rather than checksums of blocks to detect errors. Furthermore, they cannot deal with errors caused by node failure, which are common in current HPC systems. To solve these problems, this paper proposes FT-PBLAS, a PBLAS-based library for fault-tolerant parallel linear algebra computations that can be regarded as a fault-tolerant version of the parallel basic linear algebra subprograms (PBLAS), because it provides a series of fault-tolerant versions of interfaces in PBLAS. To support the underlying error detection and recovery mechanisms in the library, we propose a block-checksum approach for non-fatal errors and a scheme for addressing node failure, respectively. We evaluate two fault-tolerant mechanisms and FT-PBLAS on HPC systems, and the experimental results demonstrate the performance of our library.
BibTeX:
@article{Zhu2020a,
  author = {Y. Zhu and Y. Liu and G. Zhang},
  title = {FT-PBLAS: PBLAS-based Fault-tolerant Linear Algebra Computation on High-performance Computing Systems},
  journal = {IEEE Access},
  year = {2020},
  pages = {1-1},
  doi = {10.1109/ACCESS.2020.2975832}
}
Zou Q and Magoulès F (2020), "Reducing the effect of global synchronization in delayed gradient methods for symmetric linear systems", Advances in Engineering Software., 9, 2020. Vol. 147, pp. 102837. Elsevier BV.
Abstract: Compared with arithmetic operation, communication cost is often the bottleneck on modern computers, and thus should be paid increasing attention when choosing algorithms. Lagged gradient methods are known for their error tolerance and fast convergence. However, it appears that their parallel behavior is not well understood. In this paper, we explore the cyclic formulations of lagged gradient methods and s-dimensional methods for reducing global synchronizations. We provide parallel implementations for these methods and propose some new variants. A comparison is then reported for different gradient iterative schemes. To illustrate the performance, we run a number of experiments, from which we conclude that our formulations perform better than traditional methods in view of both iteration count and computing time.
BibTeX:
@article{Zou2020,
  author = {Qinmeng Zou and Frédéric Magoulès},
  title = {Reducing the effect of global synchronization in delayed gradient methods for symmetric linear systems},
  journal = {Advances in Engineering Software},
  publisher = {Elsevier BV},
  year = {2020},
  volume = {147},
  pages = {102837},
  doi = {10.1016/j.advengsoft.2020.102837}
}
Zounon M, Higham NJ, Lucas C and Tisseur F (2020), "Performance Evaluation of Mixed PrecisionAlgorithms for Solving Sparse Linear Systems"
Abstract: It is well established that mixed precision algorithms that factorize a matrix at a precision lower than the working precision can reduce the execution time and the energy consumption of parallel solvers for dense linear systems. Much less is known about the efficiency of mixed precision parallel algorithms for sparse linear systems, and existing work focuses on single core experiments. We evaluate the benefits of using single precision arithmetic in solving a double precision sparse linear systems using multiple cores, focusing on the key components of LU factorization and matrix–vector products. We find that single precision sparse LU factorization is prone to a severe loss of performance due to the intrusion of subnormal numbers. We identify a mechanism that allows cascading fill-ins to generate subnormal numbers and show that automatically flushing subnormals to zero avoids the performance penalties. Our results show that the anticipated speedup of 2 over a double precision LU factorization is obtained only for the very largest of our test problems. For iterative solvers, we find that for the majority of the matrices computing or applying incomplete factorization preconditioners in single precision does not present sufficient performance benefits to justify the loss of accuracy compared with the use of double precision. We also find that using single precision for the matrix–vector product kernels provides an average speedup of 1.5 over double precision kernels, but new mixed precision algorithms are needed to exploit this benefit without losing the performance gain in the process of refining the solution to double precision accuracy.
BibTeX:
@article{Zounon2020,
  author = {Zounon, Mawussi and Higham, Nicholas J. and Lucas,Craig and Tisseur, Françoise},
  title = {Performance Evaluation of Mixed PrecisionAlgorithms for Solving Sparse Linear Systems},
  year = {2020},
  url = {http://eprints.maths.manchester.ac.uk/2783/1/paper.pdf}
}