GREY BALLARD

Postdoc Opening: We are currently seeking candidates for a postdoctoral research fellowship beginning in July 2023. The fellow will work primarily with Grey Ballard and Aditya Devarakonda to design, implement, and analyze high-performance algorithms for sparse matrix and tensor computations as part of the Sparsitute. We are particularly interested in scholars who have a strong background in numerical linear algebra, numerical analysis, and/or parallel and high-performance computing. Apply Here

Full List of Publications


Abstract:
The tensor-train (TT) format is a highly compact low-rank representation for high-dimensional tensors. TT is particularly useful when representing approximations to the solutions of certain types of parametrized partial differential equations. For many of these problems, computing the solution explicitly would require an infeasible amount of memory and computational time. While the TT format makes these problems tractable, iterative techniques for solving the PDEs must be adapted to perform arithmetic while maintaining the implicit structure. The fundamental operation used to maintain feasible memory and computational time is called rounding, which truncates the internal ranks of a tensor already in TT format. We propose several randomized algorithms for this task that are generalizations of randomized low-rank matrix approximation algorithms and provide significant reduction in computation compared to deterministic TT-rounding algorithms. Randomization is particularly effective in the case of rounding a sum of TT-tensors (where we observe 20x speedup), which is the bottleneck computation in the adaptation of GMRES to vectors in TT format. We present the randomized algorithms and compare their empirical accuracy and computational time with deterministic alternatives.
BibTeX:
@article{AB+23,
  author = {Al Daas, Hussam and Ballard, Grey and Cazeaux, Paul and Hallman, Eric and Mi\k{e}dlar, Agnieszka and Pasha, Mirjeta and Reid, Tim W. and Saibaba, Arvind K.},
  title = {Randomized Algorithms for Rounding in the Tensor-Train Format},
  journal = {SIAM Journal on Scientific Computing},
  year = {2023},
  volume = {45},
  number = {1},
  pages = {A74-A95},
  url = {https://doi.org/10.1137/21M1451191}
}

Abstract:
In this paper, we focus on the parallel communication cost of multiplying a matrix with its transpose, known as a symmetric rank-k update (SYRK). SYRK requires half the computation of general matrix multiplication because of the symmetry of the output matrix. Recent work (Beaumont et al., SPAA '22) has demonstrated that the sequential I/O complexity of SYRK is also a constant factor smaller than that of general matrix multiplication. Inspired by this progress, we establish memory-independent parallel communication lower bounds for SYRK with smaller constants than general matrix multiplication, and we show that these constants are tight by presenting communication-optimal algorithms. The crux of the lower bound proof relies on extending a key geometric inequality to symmetric computations and analytically solving a constrained nonlinear optimization problem. The optimal algorithms use a triangular blocking scheme for parallel distribution of the symmetric output matrix and corresponding computation.
BibTeX:
@inproceedings{ABGKR23,
  author = {Al Daas, Hussam and Grey Ballard and Laura Grigori and Suraj Kumar and Kathryn Rouse},
  title = {Parallel Memory-Independent Communication Bounds for {SYRK}},
  booktitle = {Proceedings of the 35th {ACM} Symposium on Parallelism in Algorithms and Architectures},
  publisher = {{ACM}},
  month = {June},
  year = {2023},
  url = {https://doi.org/10.1145/3558481.3591072}
}

Abstract:
Joint Nonnegative Matrix Factorization (JointNMF) is a hybrid method for mining information from datasets that contain both feature and connection information. We propose distributed-memory parallelizations of three algorithms for solving the JointNMF problem based on Alternating Nonnegative Least Squares, Projected Gradient Descent, and Projected Gauss-Newton. We extend well-known communication-avoiding algorithms using a single processor grid case to our coupled case on two processor grids. We demonstrate the scalability of the algorithms on up to 960 cores (40 nodes) with 60% parallel efficiency. The more sophisticated Alternating Nonnegative Least Squares (ANLS) and Gauss-Newton variants outperform the first-order gradient descent method in reducing the objective on large-scale problems. We perform a topic modelling task on a large corpus of academic papers that consists of over 37 million paper abstracts and nearly a billion citation relationships, demonstrating the utility and scalability of the methods.
BibTeX:
@inproceedings{EC+23,
  author = {Eswar, Srinivas and Cobb, Benjamin and Hayashi, Koby and Kannan, Ramakrishnan and Ballard, Grey and Vuduc, Richard and Park, Haesun},
  title = {Distributed-Memory Parallel {JointNMF}},
  booktitle = {Proceedings of the 37th International Conference on Supercomputing},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  year = {2023},
  series = {ICS '23},
  pages = {301-312},
  url = {https://doi.org/10.1145/3577193.3593733}
}

Abstract:
The CP tensor decomposition is used in applications such as machine learning and signal processing to discover latent low-rank structure in multidimensional data. Computing a CP decomposition via an alternating least squares (ALS) method reduces the problem to several linear least squares problems. The standard way to solve these linear least squares subproblems is to use the normal equations, which inherit special tensor structure that can be exploited for computational efficiency. However, the normal equations are sensitive to numerical ill-conditioning, which can compromise the results of the decomposition. In this paper, we develop versions of the CP-ALS algorithm using the QR decomposition and the singular value decomposition, which are more numerically stable than the normal equations, to solve the linear least squares problems. Our algorithms utilize the tensor structure of the CP-ALS subproblems efficiently, have the same complexity as the standard CP-ALS algorithm when the input is dense and the rank is small, and are shown via examples to produce more stable results when ill-conditioning is present. Our MATLAB implementation achieves the same running time as the standard algorithm for small ranks, and we show that the new methods can obtain lower approximation error.
BibTeX:
@article{MVLB23,
  author = {Minster, Rachel and Viviano, Irina and Liu, Xiaotian and Ballard, Grey},
  title = {{CP} Decomposition for Tensors via Alternating Least Squares with {QR} Decomposition},
  journal = {Numerical Linear Algebra with Applications},
  year = {2023},
  pages = {e2511},
  url = {https://doi.org/10.1002/nla.2511}
}

Abstract:
We present efficient and scalable parallel algorithms for performing mathematical operations for low-rank tensors represented in the tensor train (TT) format. We consider algorithms for addition, elementwise multiplication, computing norms and inner products, orthonormalization, and rounding (rank truncation). These are the kernel operations for applications such as iterative Krylov solvers that exploit the TT structure. The parallel algorithms are designed for distributed-memory computation, and we propose a data distribution and strategy that parallelizes computations for individual cores within the TT format. We analyze the computation and communication costs of the proposed algorithms to show their scalability, and we present numerical experiments that demonstrate their efficiency on both shared-memory and distributed-memory parallel systems. For example, we observe better single-core performance than the existing MATLAB TT-Toolbox in rounding a 2GB TT tensor, and our implementation achieves a 34x speedup using all 40 cores of a single node. We also show nearly linear parallel scaling on larger TT tensors up to over 10,000 cores for all mathematical operations.
BibTeX:
@article{ABB22,
  author = {Al Daas, Hussam and Ballard, Grey and Benner, Peter},
  title = {Parallel Algorithms for Tensor Train Arithmetic},
  journal = {SIAM Journal on Scientific Computing},
  year = {2022},
  volume = {44},
  number = {1},
  pages = {C25-C53},
  url = {https://doi.org/10.1137/20M1387158}
}

Abstract:
Multiple Tensor-Times-Matrix (Multi-TTM) is a key computation in algorithms for computing and operating with the Tucker tensor decomposition, which is frequently used in multidimensional data analysis. We establish communication lower bounds that determine how much data movement is required to perform the Multi-TTM computation in parallel. The crux of the proof relies on analytically solving a constrained, nonlinear optimization problem. We also present a parallel algorithm to perform this computation that organizes the processors into a logical grid with twice as many modes as the input tensor. We show that with correct choices of grid dimensions,
the communication cost of the algorithm attains the lower bounds and is therefore communication optimal. Finally, we show that our algorithm can significantly reduce communication compared to the straightforward approach of expressing the computation as a sequence of tensor-times-matrix operations.
BibTeX:
@techreport{ABGKR22-TR-MTTM,
  author = {Al Daas, Hussam and Ballard, Grey and Grigori, Laura and Kumar, Suraj and Rouse, Kathryn},
  title = {Communication Lower Bounds and Optimal Algorithms for Multiple Tensor-Times-Matrix Computation},
  year = {2022},
  number = {2207.10437},
  url = {https://arxiv.org/abs/2207.10437}
}

Abstract:
Communication lower bounds have long been established for matrix multiplication algorithms. However, most methods of asymptotic analysis have either ignored the constant factors or not obtained the tightest possible values. Recent work has demonstrated that more careful analysis improves the best known constants for some classical matrix multiplication lower bounds and helps to identify more efficient algorithms that match the leading-order terms in the lower bounds exactly and improve practical performance. The main result of this work is the establishment of memory-independent communication lower bounds with tight constants for parallel matrix multiplication. Our constants improve on previous work in each of three cases that depend on the relative sizes of the aspect ratios of the matrices.
BibTeX:
@inproceedings{ABGKR22,
  author = {Al Daas, Hussam and Ballard, Grey and Grigori, Laura and Kumar, Suraj and Rouse, Kathryn},
  title = {Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds},
  booktitle = {Proceedings of the 34th Annual ACM Symposium on Parallelism in Algorithms and Architectures},
  publisher = {ACM},
  address = {New York, NY, USA},
  year = {2022},
  series = {SPAA '22},
  url = {https://doi.org/10.1145/3490148.3538552}
}

Abstract:
Communication lower bounds have long been established for matrix multiplication algorithms. However, most methods of asymptotic analysis have either ignored the constant factors or not obtained the tightest possible values. Recent work has demonstrated that more careful analysis improves the best known constants for some classical matrix multiplication lower bounds and helps to identify more efficient algorithms that match the leading-order terms in the lower bounds exactly and improve practical performance. The main result of this work is the establishment of memory-independent communication lower bounds with tight constants for parallel matrix multiplication. Our constants improve on previous work in each of three cases that depend on the relative sizes of the aspect ratios of the matrices.
BibTeX:
@techreport{ABGKR22-TR,
  author = {Al Daas, Hussam and Ballard, Grey and Grigori, Laura and Kumar, Suraj and Rouse, Kathryn},
  title = {Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds},
  year = {2022},
  url = {https://arxiv.org/abs/2205.13407}
}

Abstract:
Tensor Train (TT) is a low-rank tensor representation consisting of a series of three-way cores whose dimensions specify the TT ranks. Formal tensor train arithmetic often causes an artificial increase in the TT ranks. Thus, a key operation for applications that use the TT format is rounding, which truncates the TT ranks subject to an approximation error guarantee. Truncation is performed via SVD of a highly structured matrix, and current rounding methods require careful orthogonalization to compute an accurate SVD. We propose a new algorithm for TT-Rounding based on the Gram SVD algorithm that avoids the expensive orthogonalization phase. Our algorithm performs less computation and can be parallelized more easily than existing approaches, at the expense of a slight loss of accuracy. We demonstrate that our implementation of the rounding algorithm is efficient, scales well, and consistently outperforms the existing state-of-the-art parallel implementation in our experiments.
BibTeX:
@inproceedings{ABM22,
  author = {Al Daas, Hussam and Ballard, Grey and Manning, Lawton},
  title = {Parallel {Tensor Train} Rounding using {Gram SVD}},
  booktitle = {Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium},
  year = {2022},
  series = {IPDPS '22},
  pages = {930--940},
  url = {https://doi.org/10.1109/IPDPS53621.2022.00095}
}

BibTeX:
@techreport{MLB22-TR,
  author = {Rachel Minster and Zitong Li and Grey Ballard},
  title = {Parallel Randomized Tucker Decomposition Algorithms},
  year = {2022},
  number = {2211.13028},
  url = {https://arxiv.org/abs/2211.13028}
}

Abstract:
The Tensor-Train (TT) format is a highly compact low-rank representation for high-dimensional tensors. TT is particularly useful when representing approximations to the solutions of certain types of parametrized partial differential equations. For many of these problems, computing the solution explicitly would require an infeasible amount of memory and computational time. While the TT format makes these problems tractable, iterative techniques for solving the PDEs must be adapted to perform arithmetic while maintaining the implicit structure. The fundamental operation used to maintain feasible memory and computational time is called rounding, which truncates the internal ranks of a tensor already in TT format. We propose several randomized algorithms for this task that are generalizations of randomized low-rank matrix approximation algorithms and provide significant reduction in computation compared to deterministic TT-rounding algorithms. Randomization is particularly effective in the case of rounding a sum of TT-tensors (where we observe 20x speedup), which is the bottleneck computation in the adaptation of GMRES to vectors in TT format. We present the randomized algorithms and compare their empirical accuracy and computational time with deterministic alternatives.
BibTeX:
@techreport{AB+21,
  author = {Al Daas, Hussam and Ballard, Grey and Cazeaux, Paul and Hallman, Eric and Miedlar, Agnieszka and Pasha, Mirjeta and Reid, Tim W. and Saibaba, Arvind K.},
  title = {Randomized algorithms for rounding in the Tensor-Train format},
  year = {2021},
  url = {https://arxiv.org/abs/2110.04393}
}

Abstract:
The design and analysis of parallel algorithms are both fundamental to the set of high-performance, parallel, and distributed computing skills required to use modern computing resources efficiently. In this work, we present an approach of teaching parallel computing within an undergraduate algorithms course that combines the paradigms of dynamic programming and multithreaded parallelization. We have developed a visualization tool built with the Thread Safe Graphics Library that enables interactive demonstration of parallelization techniques for two fundamental dynamic programming problems, 0/1 Knapsack and Longest Common Subsequence. We describe the implementation of the tool, the real-time animation it produces, and the results of using it in class. The tool is publicly available to be used directly or as a basis on which to build visualizations of other parallel dynamic programming algorithms.
BibTeX:
@inproceedings{BP21,
  author = {Grey Ballard and Sarah Parsons},
  title = {Visualizing Parallel Dynamic Programming using the Thread Safe Graphics Library},
  booktitle = {Proceedings of the IEEE/ACM Ninth Workshop on Education for High Performance Computing},
  year = {2021},
  series = {EduHPC '21},
  pages = {24-31},
  url = {https://doi.org/10.1109/EduHPC54835.2021.00009}
}

Abstract:
Matrix multiplication is one of the bottleneck computations for training the weights within deep neural networks. To speed up the training phase, we propose to use faster algorithms for matrix multiplication known as Arbitrary Precision Approximating (APA) algorithms. APA algorithms perform asymptotically fewer arithmetic operations than the classical algorithm, but they compute an approximate result with an error that can be made arbitrarily small in exact arithmetic. Practical APA algorithms provide significant reduction in computation time and still provide enough accuracy for many applications like neural network training. We demonstrate that APA algorithms can be efficiently implemented and parallelized for multicore CPUs to obtain up to 28% and 21% speedups over the fastest implementation of the classical algorithm using one core and 12 cores, respectively. Furthermore, using these algorithms to train a Multi-Layer Perceptron (MLP) network yields no significant reduction in the training or testing error. Our performance results on a large MLP network show overall sequential and multithreaded performance improvements of up to 25% and 13%, respectively. We also demonstrate up to 15% improvement when training the fully connected layers of the VGG-19 image classification network.
BibTeX:
@inproceedings{BWZ21,
  author = {Grey Ballard and Jack Weissenberger and Luoping Zhang},
  title = {Accelerating Neural Network Training using Arbitrary Precision Approximating Matrix Multiplication Algorithms},
  booktitle = {50th International Conference on Parallel Processing Workshop},
  publisher = {ACM},
  address = {New York, NY, USA},
  month = {August},
  year = {2021},
  series = {ICPP Workshops '21},
  pages = {8},
  url = {https://doi.org/10.1145/3458744.3474050}
}

Abstract:
We consider the problem of low-rank approximation of massive dense nonnegative tensor data, for example, to discover latent patterns in video and imaging applications. As the size of data sets grows, single workstations are hitting bottlenecks in both computation time and available memory. We propose a distributed-memory parallel computing solution to handle massive data sets, loading the input data across the memories of multiple nodes, and performing efficient and scalable parallel algorithms to compute the low-rank approximation. We present a software package called Parallel Low-rank Approximation with Nonnegativity Constraints, which implements our solution and allows for extension in terms of data (dense or sparse, matrices or tensors of any order), algorithm (e.g., from multiplicative updating techniques to alternating direction method of multipliers), and architecture (we exploit GPUs to accelerate the computation in this work). We describe our parallel distributions and algorithms, which are careful to avoid unnecessary communication and computation, show how to extend the software to include new algorithms and/or constraints, and report efficiency and scalability results for both synthetic and real-world data sets.
BibTeX:
@article{EH+21,
  author = {Eswar, Srinivas and Hayashi, Koby and Ballard, Grey and Kannan, Ramakrishnan and Matheson, Michael A. and Park, Haesun},
  title = {{PLANC}: Parallel Low-Rank Approximation with Nonnegativity Constraints},
  journal = {ACM Transactions on Mathematical Software},
  publisher = {ACM},
  address = {New York, NY, USA},
  month = {June},
  year = {2021},
  volume = {47},
  number = {3},
  url = {https://doi.org/10.1145/3432185}
}

Abstract:
Tucker decomposition is a low-rank tensor approximation that generalizes a truncated matrix singular value decomposition (SVD). Existing parallel software has shown that Tucker decomposition is particularly effective at compressing terabyte-sized multidimensional scientific simulation datasets, computing reduced representations that satisfy a specified approximation error. The general approach is to get a low-rank approximation of the input data by performing a sequence of matrix SVDs of tensor unfoldings, which tend to be short-fat matrices. In the existing approach, the SVD is performed by computing the eigendecomposition of the Gram matrix of the unfolding. This method sacrifices some numerical stability in exchange for lower computation costs and easier parallelization. We propose using a more numerically stable though more computationally expensive way to compute the SVD by pre- processing with a QR decomposition step and computing an SVD of only the small triangular factor. The more numerically stable approach allows us to achieve the same accuracy with half the working precision (for example, single rather than double precision). We demonstrate that our method scales as well as the existing approach, and the use of lower precision leads to an overall reduction in running time of up to a factor of 2 when using 10s to 1000s of processors. Using the same working precision, we are also able to compute Tucker decompositions with much smaller approximation error.
BibTeX:
@inproceedings{LFB21,
  author = {Zitong Li and Qiming Fang and and Grey Ballard},
  title = {Parallel Tucker Decomposition with Numerically Accurate SVD},
  booktitle = {50th International Conference on Parallel Processing},
  publisher = {ACM},
  address = {New York, NY, USA},
  month = {August},
  year = {2021},
  series = {ICPP '21},
  pages = {11},
  url = {https://doi.org/10.1145/3472456.3472472}
}

BibTeX:
@techreport{ABB20-TR,
  author = {Al Daas, Hussam and Ballard, Grey and Benner, Peter},
  title = {Parallel Algorithms for Tensor Train Arithmetic},
  year = {2020},
  number = {2011.06532},
  url = {https://arxiv.org/abs/2011.06532}
}

Abstract:
Our goal is compression of massive-scale grid-structured data, such as the multi-terabyte output of a high-fidelity computational simulation. For such data sets, we have developed a new software package called TuckerMPI, a parallel C++/MPI software package for compressing distributed data. The approach is based on treating the data as a tensor, i.e., a multidimensional array, and computing its truncated Tucker decomposition, a higher-order analogue to the truncated singular value decomposition of a matrix. The result is a low-rank approximation of the original tensor-structured data. Compression efficiency is achieved by detecting latent global structure within the data, which we contrast to most compression methods that are focused on local structure. In this work, we describe TuckerMPI, our implementation of the truncated Tucker decomposition, including details of the data distribution and in-memory layouts, the parallel and serial implementations of the key kernels, and analysis of the storage, communication, and computational costs. We test the software on 4.5 and 6.7 terabyte data sets distributed across 100 s of nodes (1,000 s of MPI processes), achieving compression ratios between 100 and 200,000x, which equates to 99--99.999% compression (depending on the desired accuracy) in substantially less time than it would take to even read the same dataset from a parallel file system. Moreover, we show that our method also allows for reconstruction of partial or down-sampled data on a single node, without a parallel computer so long as the reconstructed portion is small enough to fit on a single machine, e.g., in the instance of reconstructing/visualizing a single down-sampled time step or computing summary statistics. The code is available at https://gitlab.com/tensors/TuckerMPI.
BibTeX:
@article{BKK20,
  author = {Ballard, Grey and Klinvex, Alicia and Kolda, Tamara G.},
  title = {{TuckerMPI}: A Parallel {C++/MPI} Software Package for Large-Scale Data Compression via the Tucker Tensor Decomposition},
  journal = {ACM Transactions on Mathematical Software},
  publisher = {ACM},
  address = {New York, NY, USA},
  month = {June},
  year = {2020},
  volume = {46},
  number = {2},
  url = {https://dl.acm.org/doi/10.1145/3378445}
}

Abstract:
Our goal is to establish lower bounds on the communication required to perform the Matricized-Tensor Times Khatri-Rao Product (MTTKRP) computation on a distributed-memory parallel machine. MTTKRP is the bottleneck computation within algorithms for computing the CP tensor decomposition, which is an approximation by a sum of rank-one tensors and frequently used in multidimensional data analysis. The main result of this paper is a communication lower bound that generalizes previous results, tightening the bound so that it is attainable even when the tensor dimensions vary (the tensor is not cubical) and when the number of processors is small relative to the tensor dimensions. The attainability of the bound proves that the algorithm that attains it, which is based on a block distribution of the tensor and communicating only factor matrices, is communication optimal. The proof technique utilizes an established inequality that relates computations to data access as well as a novel approach based on convex optimization.
BibTeX:
@inproceedings{BR20,
  author = {Grey Ballard and Kathryn Rouse},
  title = {General Memory-Independent Lower Bound for MTTKRP},
  booktitle = {Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing},
  month = {January},
  year = {2020},
  pages = {1-11},
  url = {https://epubs.siam.org/doi/abs/10.1137/1.9781611976137.1}
}

BibTeX:
@techreport{DB20,
  author = {Karen Devine and Grey Ballard},
  title = {{GentenMPI}: Distributed Memory Sparse Tensor Decomposition},
  year = {2020},
  number = {SAND2020-8515},
  url = {https://www.osti.gov/biblio/1656940}
}

Abstract:
We develop the first distributed-memory parallel implementation of Symmetric Nonnegative Matrix Factorization (SymNMF), a key data analytics kernel for clustering and dimensionality reduction. Our implementation includes two different algorithms for SymNMF, which give comparable results in terms of time and accuracy. The first algorithm is a parallelization of an existing sequential approach that uses solvers for nonsymmetric NMF. The second algorithm is a novel approach based on the Gauss-Newton method. It exploits second-order information without incurring large computational and memory costs. We evaluate the scalability of our algorithms on the Summit system at Oak Ridge National Laboratory, scaling up to 128 nodes (4096 cores) with 70% efficiency. Additionally, we demonstrate our software on an image segmentation task.
BibTeX:
@inproceedings{EH+20,
  author = {S. Eswar and K. Hayashi and G. Ballard and R. Kannan and R. Vuduc and H. Park},
  title = {Distributed-Memory Parallel Symmetric Nonnegative Matrix Factorization},
  booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC)},
  publisher = {IEEE Computer Society},
  address = {Los Alamitos, CA, USA},
  month = {November},
  year = {2020},
  pages = {1041-1054},
  url = {https://doi.ieeecomputersociety.org/10.1109/SC41405.2020.00078}
}

Abstract:
Nonnegative Matrix Factorization (NMF) is an effective tool for clustering nonnegative data, either for computing a flat partitioning of a dataset or for determining a hierarchy of similarity. In this paper, we propose a parallel algorithm for hierarchical clustering that uses a divide-and-conquer approach based on rank-two NMF to split a data set into two cohesive parts. Not only does this approach uncover more structure in the data than a flat NMF clustering, but also rank-two NMF can be computed more quickly than for general ranks, providing comparable overall time to solution. Our data distribution and parallelization strategies are designed to maintain computational load balance throughout the data- dependent hierarchy of computation while limiting interprocess communication, allowing the algorithm to scale to large dense and sparse data sets. We demonstrate the scalability of our parallel algorithm in terms of data size (up to 800 GB) and number of processors (up to 80 nodes of the Summit supercomputer), applying the hierarchical clustering approach to hyperspectral imaging and image classification data. Our algorithm for Rank-2 NMF scales perfectly on up to 1000s of cores and the entire hierarchical clustering method achieves 5.9x speedup scaling from 10 to 80 nodes on the 800 GB dataset.
BibTeX:
@inproceedings{MBKP20,
  author = {Lawton Manning and Grey Ballard and Ramakrishnan Kannan and Haesun Park},
  title = {Parallel Hierarchical Clustering using Rank-Two Nonnegative Matrix Factorization},
  booktitle = {Proceedings of the 27th IEEE International Conference on High Performance Computing},
  publisher = {IEEE Computer Society},
  address = {Los Alamitos, CA, USA},
  month = {December},
  year = {2020},
  pages = {141-150},
  url = {https://ieeexplore.ieee.org/document/9406651}
}

BibTeX:
@article{Ballard19,
  author = {Grey Ballard},
  title = {{CP} and {Tucker} Tensor Decompositions},
  journal = {SIAG/OPT Views and News},
  month = {January},
  year = {2019},
  volume = {27},
  number = {1},
  pages = {1--7},
  url = {http://wiki.siam.org/siag-op/images/siag-op/b/bd/ViewsAndNews-27-1.pdf}
}

Abstract:
We introduce a Generalized Randomized QR-decomposition that may be applied to arbitrary products of matrices and their inverses, without needing to explicitly compute the products or inverses. This factorization is a critical part of a communication-optimal spectral divide-and-conquer algorithm for the nonsymmetric eigenvalue problem. In this paper, we establish that this randomized QR-factorization satisfies the strong rank-revealing properties. We also formally prove its stability, making it suitable in applications. Finally, we present numerical experiments which demonstrate that our theoretical bounds capture the empirical behavior of the factorization.
BibTeX:
@techreport{BDDR19,
  author = {Grey Ballard and James Demmel and Ioana Dumitriu and Alexander Rusciano},
  title = {A Generalized Randomized Rank-Revealing Factorization},
  year = {2019},
  number = {1909.06524},
  url = {https://arxiv.org/abs/1909.06524}
}

Abstract:
Our goal is compression of massive-scale grid-structured data, such as the multi-terabyte output of a high-fidelity computational simulation. For such data sets, we have developed a new software package called TuckerMPI, a parallel C++/MPI software package for compressing distributed data. The approach is based on treating the data as a tensor, i.e., a multidimensional array, and computing its truncated Tucker decomposition, a higher-order analogue to the truncated singular value decomposition of a matrix. The result is a low-rank approximation of the original tensor-structured data. Compression efficiency is achieved by detecting latent global structure within the data, which we contrast to most compression methods that are focused on local structure. In this work, we describe TuckerMPI, our implementation of the truncated Tucker decomposition, including details of the data distribution and in-memory layouts, the parallel and serial implementations of the key kernels, and analysis of the storage, communication, and computational costs. We test the software on 4.5 terabyte and 6.7 terabyte data sets distributed across 100s of nodes (1000s of MPI processes), achieving compression rates between 100-200,000x which equates to 99-99.999% compression (depending on the desired accuracy) in substantially less time than it would take to even read the same dataset from a parallel filesystem. Moreover, we show that our method also allows for reconstruction of partial or down-sampled data on a single node, without a parallel computer so long as the reconstructed portion is small enough to fit on a single machine, e.g., in the instance of reconstructing/visualizing a single down-sampled time step or computing summary statistics.
BibTeX:
@techreport{BKK19-TR,
  author = {Grey Ballard and Alicia Klinvex and Tamara G. Kolda},
  title = {{TuckerMPI}: Efficient parallel software for {Tucker} decompositions of dense tensors},
  year = {2019},
  number = {1901.06043},
  url = {https://arxiv.org/abs/1901.06043}
}

Abstract:
We consider the problem of low-rank approximation of massive dense non-negative tensor data, for example to discover latent patterns in video and imaging applications. As the size of data sets grows, single workstations are hitting bottlenecks in both computation time and available memory. We propose a distributed-memory parallel computing solution to handle massive data sets, loading the input data across the memories of multiple nodes and performing efficient and scalable parallel algorithms to compute the low-rank approximation. We present a software package called PLANC (Parallel Low Rank Approximation with Non-negativity Constraints), which implements our solution and allows for extension in terms of data (dense or sparse, matrices or tensors of any order), algorithm (e.g., from multiplicative updating techniques to alternating direction method of multipliers), and architecture (we exploit GPUs to accelerate the computation in this work).We describe our parallel distributions and algorithms, which are careful to avoid unnecessary communication and computation, show how to extend the software to include new algorithms and/or constraints, and report efficiency and scalability results for both synthetic and real-world data sets.
BibTeX:
@techreport{EH+19-TR,
  author = {Eswar, Srinivas and Hayashi, Koby and Ballard, Grey and Kannan, Ramakrishnan and Matheson, Michael A. and Park, Haesun},
  title = {{PLANC}: Parallel Low Rank Approximation with Non-negativity Constraints},
  year = {2019},
  number = {1909.01149},
  url = {https://arxiv.org/abs/1909.01149}
}

Abstract:
We consider the problem of joint three-dimensional (3D) localization and material classification of unresolved space debris using a multispectral rotating point spread function (RPSF). The use of RPSF allows one to estimate the 3D locations of point sources from their rotated images acquired by a single 2D sensor array, since the amount of rotation of each source image about its x, y location depends on its axial distance z. Using multispectral images, with one RPSF per spectral band, we are able not only to localize the 3D positions of the space debris but also classify their material composition. We propose a three-stage method for achieving joint localization and classification. In stage 1, we adopt an optimization scheme for localization in which the spectral signature of each material is assumed to be uniform, which significantly improves efficiency and yields better localization results than possible with a single spectral band. In stage 2, we estimate the spectral signature and refine the localization result via an alternating approach. We process classification in the final stage. Both Poisson noise and Gaussian noise models are considered, and the implementation of each is discussed. Numerical tests using multispectral data from NASA show the efficiency of our three-stage approach and illustrate the improvement of point source localization and spectral classification from using multiple bands over a single band.
BibTeX:
@article{WBPP19,
  author = {Chao Wang and Grey Ballard and Robert Plemmons and Sudhakar Prasad},
  title = {Joint {3D} localization and classification of space debris using a multispectral rotating point spread function},
  journal = {Applied Optics},
  publisher = {OSA},
  month = {Nov},
  year = {2019},
  volume = {58},
  number = {31},
  pages = {8598--8611},
  url = {https://pubmed.ncbi.nlm.nih.gov/31873353/}
}

Abstract:
Interprocessor communication often dominates the runtime of large matrix computations. We present a parallel algorithm for computing QR decompositions whose bandwidth cost (communication volume) can be decreased at the cost of increasing its latency cost (number of messages). By varying a parameter to navigate the bandwidth/latency tradeoff, we can tune this algorithm for machines with different communication costs.
BibTeX:
@inproceedings{BDGJK18,
  author = {Grey Ballard and James Demmel and Laura Grigori and Mathias Jacquelin and Nicholas Knight},
  title = {A {3D} Parallel Algorithm for {QR} Decomposition},
  booktitle = {Proceedings of the 30th Symposium on Parallelism in Algorithms and Architectures},
  publisher = {ACM},
  address = {New York, NY, USA},
  year = {2018},
  series = {SPAA '18},
  pages = {55--65},
  url = {http://doi.acm.org/10.1145/3210377.3210415}
}

Abstract:
Interprocessor communication often dominates the runtime of large matrix computations. We present a parallel algorithm for computing QR decompositions whose bandwidth cost (communication volume) can be decreased at the cost of increasing its latency cost (number of messages). By varying a parameter to navigate the bandwidth/latency tradeoff, we can tune this algorithm for machines with different communication costs.
BibTeX:
@techreport{BDGJK18-TR,
  author = {Grey Ballard and James Demmel and Laura Grigori and Mathias Jacquelin and Nicholas Knight},
  title = {A 3D Parallel Algorithm for QR Decomposition},
  year = {2018},
  number = {1805.05278},
  url = {https://arxiv.org/abs/1805.05278}
}

Abstract:
The CP tensor decomposition is a low-rank approximation of a tensor. We present a distributed-memory parallel algorithm and implementation of an alternating optimization method for computing a CP decomposition of dense tensors that can enforce nonnegativity of the computed low-rank factors. The principal task is to parallelize the Matricized-Tensor Times Khatri-Rao Product (MTTKRP) bottleneck subcomputation. The algorithm is computation efficient, using dimension trees to avoid redundant computation across MTTKRPs within the alternating method. Our approach is also communication efficient, using a data distribution and parallel algorithm across a multidimensional processor grid that can be tuned to minimize communication. We benchmark our software on synthetic as well as hyperspectral image and neuroscience dynamic functional connectivity data, demonstrating that our algorithm scales well to 100s of nodes (up to 4096 cores) and is faster and more general than the currently available parallel software.
BibTeX:
@inproceedings{BHK18,
  author = {Grey Ballard and Koby Hayashi and Ramakrishnan Kannan},
  title = {Parallel Nonnegative {CP} Decomposition of Dense Tensors},
  booktitle = {25th IEEE International Conference on High Performance Computing (HiPC)},
  month = {Dec},
  year = {2018},
  pages = {22-31},
  url = {https://ieeexplore.ieee.org/document/8638076}
}

Abstract:
The CP tensor decomposition is a low-rank approximation of a tensor. We present a distributed-memory parallel algorithm and implementation of an alternating optimization method for computing a CP decomposition of dense tensor data that can enforce nonnegativity of the computed low-rank factors. The principal task is to parallelize the matricized-tensor times Khatri-Rao product (MTTKRP) bottleneck subcomputation. The algorithm is computation efficient, using dimension trees to avoid redundant computation across MTTKRPs within the alternating method. Our approach is also communication efficient, using a data distribution and parallel algorithm across a multidimensional processor grid that can be tuned to minimize communication. We benchmark our software on synthetic as well as hyperspectral image and neuroscience dynamic functional connectivity data, demonstrating that our algorithm scales well to 100s of nodes (up to 4096 cores) and is faster and more general than the currently available parallel software.
BibTeX:
@techreport{BHK18-TR,
  author = {Grey Ballard and Koby Hayashi and Ramakrishnan Kannan},
  title = {Parallel Nonnegative {CP} Decomposition of Dense Tensors},
  year = {2018},
  number = {1806.07985},
  url = {https://arxiv.org/abs/1806.07985}
}

Abstract:
This is the second in a series of papers on rank decompositions of the matrix multiplication tensor. We present new rank 23 decompositions for the 33 matrix multiplication tensor M_⟨ 3 ⟩. All our decompositions have symmetry groups that include the standard cyclic permutation of factors but otherwise exhibit a range of behavior. One of them has 11 cubes as summands and admits an unexpected symmetry group of order 12. We establish basic information regarding symmetry groups of decompositions and outline two approaches for finding new rank decompositions of M_⟨ n ⟩ for larger n.
BibTeX:
@article{BILR18,
  author = {Grey Ballard and Christian Ikenmeyer and J.M. Landsberg and Nick Ryder},
  title = {The geometry of rank decompositions of matrix multiplication II: 3x3 matrices},
  journal = {Journal of Pure and Applied Algebra},
  year = {2018},
  volume = {223},
  number = {8},
  pages = {3205 - 3224},
  url = {http://www.sciencedirect.com/science/article/pii/S0022404918302548}
}

Abstract:
This is the second in a series of papers on rank decompositions of the matrix multiplication tensor. We present new rank 23 decompositions for the 3 × 3 matrix multiplication tensor M_⟨ 3 ⟩. All our decompositions have symmetry groups that include the standard cyclic permutation of factors but otherwise exhibit a range of behavior. One of them has 11 cubes as summands and admits an unexpected symmetry group of order 12. We establish basic information regarding symmetry groups of decompositions and outline two approaches for finding new rank decompositions of M_⟨ n ⟩ for larger n.
BibTeX:
@techreport{BILR18-TR,
  author = {Grey Ballard and Christian Ikenmeyer and J.M. Landsberg and Nick Ryder},
  title = {The geometry of rank decompositions of matrix multiplication {II}: 3x3 matrices},
  year = {2018},
  number = {1801.00843},
  url = {https://arxiv.org/abs/1801.00843}
}

Abstract:
The matricized-tensor times Khatri-Rao product (MTTKRP) computation is the typical bottleneck in algorithms for computing a CP decomposition of a tensor. In order to develop high performance sequential and parallel algorithms, we establish communication lower bounds that identify how much data movement is required for this computation in the case of dense tensors. We also present sequential and parallel algorithms that attain the lower bounds and are therefore communication optimal. In particular, we show that the structure of the computation allows for less communication than the straightforward approach of casting the computation as a matrix multiplication operation.
BibTeX:
@inproceedings{BKR18,
  author = {Grey Ballard and Nicholas Knight and Kathryn Rouse},
  title = {Communication Lower Bounds for Matricized Tensor Times {Khatri-Rao} Product},
  booktitle = {Proceedings of the 32nd IEEE International Parallel and Distributed Processing Symposium},
  month = {May},
  year = {2018},
  pages = {557-567},
  url = {https://ieeexplore.ieee.org/document/8425209/}
}

Abstract:
The CANDECOMP/PARAFAC (CP) decomposition is a leading method for the analysis of multiway data. The standard alternating least squares algorithm for the CP decomposition (CP-ALS) involves a series of highly overdetermined linear least squares problems. We extend randomized least squares methods to tensors and show that the workload of CP-ALS can be drastically reduced without a sacrifice in quality. We introduce techniques for efficiently preprocessing, sampling, and computing randomized least squares on a dense tensor of arbitrary order, as well as an efficient sampling-based technique for checking the stopping condition. We also show more generally that the Khatri--Rao product (used within the CP-ALS iteration) produces conditions favorable for direct sampling. In numerical results, we see improvements in speed, reductions in memory requirements, and robustness with respect to initialization.
BibTeX:
@article{BBK18,
  author = {Casey Battaglino and Grey Ballard and Tamara G. Kolda},
  title = {A Practical Randomized {CP} Tensor Decomposition},
  journal = {SIAM Journal on Matrix Analysis and Applications},
  year = {2018},
  volume = {39},
  number = {2},
  pages = {876-901},
  url = {https://doi.org/10.1137/17M1112303}
}

Abstract:
The matricized-tensor times Khatri-Rao product (MTTKRP) is the computational bottleneck for algorithms computing CP decompositions of tensors. In this work, we develop shared-memory parallel algorithms for MTTKRP involving dense tensors. The algorithms cast nearly all of the computation as matrix operations in order to use optimized BLAS subroutines, and they avoid reordering tensor entries in memory. We use our parallel implementation to compute a CP decomposition of a neuroimaging data set and achieve a speedup of up to 7.4X over existing parallel software.
BibTeX:
@inproceedings{HBJT18,
  author = {Hayashi, Koby and Ballard, Grey and Jiang, Yujie and Tobia, Michael J.},
  title = {Extended Abstract: Shared-memory Parallelization of {MTTKRP} for Dense Tensors},
  booktitle = {Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
  publisher = {ACM},
  address = {New York, NY, USA},
  year = {2018},
  series = {PPoPP '18},
  pages = {393--394},
  url = {http://doi.acm.org/10.1145/3178487.3178522}
}

Abstract:
Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors W and H, for the given input matrix A, such that WH approximates A. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel algorithms to solve the problem for big data sets. The main contribution of this work is a new, high-performance parallel computational framework for a broad class of NMF algorithms that iteratively solves alternating non-negative least squares (NLS) subproblems for mathbfW and mathbfH . It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). The framework is flexible and able to leverage- a variety of NMF and NLS algorithms, including Multiplicative Update, Hierarchical Alternating Least Squares, and Block Principal Pivoting. Our implementation allows us to benchmark and compare different algorithms on massive dense and sparse data matrices of size that spans from few hundreds of millions to billions. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements. The code and the datasets used for conducting the experiments are available online.
BibTeX:
@article{KBP17,
  author = {Ramakrishnan Kannan and Grey Ballard and Haesun Park},
  title = {{MPI-FAUN}: An {MPI}-Based Framework for Alternating-Updating Nonnegative Matrix Factorization},
  journal = {IEEE Transactions on Knowledge and Data Engineering},
  month = {March},
  year = {2018},
  volume = {30},
  number = {3},
  pages = {544-558},
  url = {https://www.computer.org/csdl/trans/tk/2018/03/08089433-abs.html}
}

Abstract:
Non-negative matrix factorization (NMF), the problem of finding two non-negative low-rank factors whose product approximates an input matrix, is a useful tool for many data mining and scientific applications such as topic modeling in text mining and unmixing in microscopy. In this paper, we focus on scaling algorithms for NMF to very large sparse datasets and massively parallel machines by employing effective algorithms, communication patterns, and partitioning schemes that leverage the sparsity of the input matrix. We consider two previous works developed for related problems,one that uses a fine-grained partitioning strategy using a point-to-point communication pattern and one that uses a Cartesian, or checkerboard, partitioning strategy using a collective-based communication pattern. We show that a combination of the previous approaches balances the demands of the various computations within NMF algorithms and achieves high efficiency and scalability. From the experiments, we see that our proposed strategy runs up to 10x faster than the state of the art on real-world datasets.
BibTeX:
@inproceedings{KKB18,
  author = {Oguz Kaya and Ramakrishnan Kannan and Grey Ballard},
  title = {Partitioning and Communication Strategies for Sparse Non-negative Matrix Factorization},
  booktitle = {Proceedings of the 47th International Conference on Parallel Processing},
  publisher = {ACM},
  address = {New York, NY, USA},
  year = {2018},
  series = {ICPP 2018},
  pages = {90:1--90:10},
  url = {https://doi.org/10.1145/3225058.3225127}
}

Abstract:
There is a growing interest in using so-called dynamic functional connectivity, as the conventional static brain connectivity models are being questioned. Brain network analyses yield complex network data that are difficult to analyze and interpret. To deal with the complex structures, decomposition/factorization techniques that simplify the data are often used. For dynamic network analyses, data simplification is of even greater importance, as dynamic connectivity analyses result in a time series of complex networks. A new challenge that must be faced when using these decomposition/factorization techniques is how to interpret the resulting connectivity patterns. Connectivity patterns resulting from decomposition analyses are often visualized as networks in brain space, in the same way that pairwise correlation networks are visualized. This elevates the risk of conflating connections between nodes that represent correlations between nodes' time series with connections between nodes that result from decomposition analyses. Moreover, dynamic connectivity data may be represented with three-dimensional or four-dimensional (4D) tensors and decomposition results require unique interpretations. Thus, the primary goal of this article is to (1) address the issues that must be considered when interpreting the connectivity patterns from decomposition techniques and (2) show how the data structure and decomposition method interact to affect this interpretation. The outcome of our analyses is summarized as follows. (1) The edge strength in decomposition connectivity patterns represents complex relationships not pairwise interactions between the nodes. (2) The structure of the data significantly alters the connectivity patterns, for example, 4D data result in connectivity patterns with higher regional connections. (3) Orthogonal decomposition methods outperform in feature reduction applications, whereas nonorthogonal decomposition methods are better for mechanistic interpretation.
BibTeX:
@article{MLRB18,
  author = {Mokhtari, Fatemeh and Laurienti, Paul J and Rejeski, W Jack and Ballard, Grey},
  title = {Dynamic Functional Magnetic Resonance Imaging Connectivity Tensor Decomposition: A New Approach to Analyze and Interpret Dynamic Brain Connectivity},
  journal = {Brain connectivity},
  month = {December},
  year = {2018},
  url = {https://doi.org/10.1089/brain.2018.0605}
}

Abstract:
The matricized-tensor times Khatri-Rao product (MTTKRP) computation is the typical bottleneck in algorithms for computing a CP decomposition of a tensor. In order to develop high performance sequential and parallel algorithms, we establish communication lower bounds that identify how much data movement is required for this computation in the case of dense tensors. We also present sequential and parallel algorithms that attain the lower bounds and are therefore communication optimal. In particular, we show that the structure of the computation allows for less communication than the straightforward approach of casting the computation as a matrix multiplication operation.
BibTeX:
@techreport{BKR17-TR,
  author = {Grey Ballard and Nicholas Knight and Kathryn Rouse},
  title = {Communication Lower Bounds for Matricized Tensor Times {Khatri-Rao} Product},
  year = {2017},
  number = {1708.07401},
  url = {https://arxiv.org/abs/1708.07401}
}

Abstract:
The CANDECOMP/PARAFAC (CP) decomposition is a leading method for the analysis of multiway data. The standard alternating least squares algorithm for the CP decomposition (CP-ALS) involves a series of highly overdetermined linear least squares problems. We extend randomized least squares methods to tensors and show the workload of CP-ALS can be drastically reduced without a sacrifice in quality. We introduce techniques for efficiently preprocessing, sampling, and computing randomized least squares on a dense tensor of arbitrary order, as well as an efficient sampling-based technique for checking the stopping condition. We also show more generally that the Khatri-Rao product (used within the CP-ALS iteration) produces conditions favorable for direct sampling. In numerical results, we see improvements in speed, reductions in memory requirements, and robustness with respect to initialization.
BibTeX:
@techreport{BBK17-TR,
  author = {Casey Battaglino and Grey Ballard and Tamara G. Kolda},
  title = {A Practical Randomized {CP} Tensor Decomposition},
  year = {2017},
  number = {1701.06600},
  url = {https://arxiv.org/abs/1701.06600}
}

Abstract:
The matricized-tensor times Khatri-Rao product (MTTKRP) is the computational bottleneck for algorithms computing CP decompositions of tensors. In this paper, we develop shared-memory parallel algorithms for MTTKRP involving dense tensors. The algorithms cast nearly all of the computation as matrix operations in order to use optimized BLAS subroutines, and they avoid reordering tensor entries in memory. We benchmark sequential and parallel performance of our implementations, demonstrating high sequential performance and efficient parallel scaling. We use our parallel implementation to compute a CP decomposition of a neuroimaging data set and achieve a speedup of up to 7.4x over existing parallel software.
BibTeX:
@techreport{HBJT17-TR,
  author = {Koby Hayashi and Grey Ballard and Jeffrey Jiang and Michael J. Tobia},
  title = {Shared Memory Parallelization of MTTKRP for Dense Tensors},
  year = {2017},
  number = {1708.08976},
  url = {https://arxiv.org/abs/1708.08976}
}

Abstract:
Many large-scale scientific computations require eigenvalue solvers in a scaling regime where efficiency is limited by data movement. We introduce a parallel algorithm for computing the eigenvalues of a dense symmetric matrix, which performs asymptotically less communication than previously known approaches. We provide analysis in the Bulk Synchronous Parallel (BSP) model with additional consideration for communication between a local memory and cache. Given sufficient memory to store c copies of the symmetric matrix, our algorithm requires (c) less interprocessor communication than previously known algorithms, for any c≤ p^1/3 when using p processors. The algorithm first reduces the dense symmetric matrix to a banded matrix with the same eigenvalues. Subsequently, the algorithm employs successive reduction to O(log p) thinner banded matrices. We employ two new parallel algorithms that achieve lower communication costs for the full-to-band and band-to-band reductions. Both of these algorithms leverage a novel QR factorization algorithm for rectangular matrices.
BibTeX:
@inproceedings{SBDH17,
  author = {Solomonik, Edgar and Ballard, Grey and Demmel, James and Hoefler, Torsten},
  title = {A Communication-Avoiding Parallel Algorithm for the Symmetric Eigenvalue Problem},
  booktitle = {Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures},
  publisher = {ACM},
  address = {New York, NY, USA},
  year = {2017},
  series = {SPAA '17},
  pages = {111--121},
  url = {http://doi.acm.org/10.1145/3087556.3087561}
}

Abstract:
Exposure to acute stress induces multiple emotional responses, each with their own unique temporal dynamics. Dynamic functional connectivity (dFC) measures the temporal variability of network synchrony and captures individual differences in network neurodynamics. This study investigated the relationship between dFC and individual differences in emotions induced by an acute psychosocial stressor. Sixteen healthy adult women underwent fMRI scanning during a social evaluative threat (SET) task, and retrospectively completed questionnaires that assessed individual differences in subjectively experienced positive and negative emotions about stress and stress relief during the task. Group dFC was decomposed with parallel factor analysis (PARAFAC) into 10 components, each with a temporal signature, spatial network of functionally connected regions, and vector of participant loadings that captures individual differences in dFC. Participant loadings of two networks were positively correlated with stress-related emotions, indicating the existence of networks for positive and negative emotions. The emotion-related networks involved the ventromedial prefrontal cortex, cingulate cortex, anterior insula, and amygdala, among other distributed brain regions, and time signatures for these emotion-related networks were uncorrelated. These findings demonstrate that individual differences in stress-induced positive and negative emotions are each uniquely associated with large-scale brain networks, and suggest that dFC is a mechanism that generates individual differences in the emotional components of the stress response.
BibTeX:
@article{THBGW17,
  author = {Tobia, Michael J. and Hayashi, Koby and Ballard, Grey and Gotlib, Ian H. and Waugh, Christian E.},
  title = {Dynamic functional connectivity and individual differences in emotions during social stress},
  journal = {Human Brain Mapping},
  year = {2017},
  volume = {38},
  number = {12},
  pages = {6185--6205},
  url = {http://dx.doi.org/10.1002/hbm.23821}
}

Abstract:
As parallel computing trends towards the exascale, scientific data produced by high-fidelity simulations are growing increasingly massive. For instance, a simulation on a three-dimensional spatial grid with 512 points per dimension that tracks 64 variables per grid point for 128 time steps yields 8 TB of data, assuming double precision. By viewing the data as a dense five-way tensor, we can compute a Tucker decomposition to find inherent low-dimensional multilinear structure, achieving compression ratios of up to 5000 on real-world data sets with negligible loss in accuracy. So that we can operate on such massive data, we present the first-ever distributed-memory parallel implementation for the Tucker decomposition, whose key computations correspond to parallel linear algebra operations, albeit with nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel. We provide accompanying analysis of the computation and communication costs of the algorithms. To demonstrate the compression and accuracy of the method, we apply our approach to real-world data sets from combustion science simulations. We also provide detailed performance results, including parallel performance in both weak and strong scaling experiments.
BibTeX:
@inproceedings{ABK16,
  author = {Woody Austin and Grey Ballard and Tamara G. Kolda},
  title = {Parallel Tensor Compression for Large-Scale Scientific Data},
  booktitle = {Proceedings of the 30th IEEE International Parallel and Distributed Processing Symposium},
  month = {May},
  year = {2016},
  pages = {912--922},
  url = {https://www.computer.org/csdl/proceedings/ipdps/2016/2140/00/2140a912-abs.html}
}

Abstract:
Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdos--Renyi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first implementation of the 3D SpGEMM formulation that exploits multiple (intranode and internode) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research.
BibTeX:
@article{AB+16,
  author = {Ariful Azad and Grey Ballard and Aydin Bulu\c{c} and James Demmel and Laura Grigori and Oded Schwartz and Sivan Toledo and Samuel Williams},
  title = {Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication},
  journal = {SIAM Journal on Scientific Computing},
  year = {2016},
  volume = {38},
  number = {6},
  pages = {C624-C651},
  url = {http://dx.doi.org/10.1137/15M104253X}
}

Abstract:
Fast algorithms for matrix multiplication, namely those that perform asymptotically fewer scalar operations than the classical algorithm, have been considered primarily of theoretical interest. Apart from Strassen's original algorithm, few fast algorithms have been efficiently implemented or used in practical applications. However, there exist many practical alternatives to Strassen's algorithm with varying performance and numerical properties. Fast algorithms are known to be numerically stable, but because their error bounds are slightly weaker than the classical algorithm, they are not used even in cases where they provide a performance benefit. We argue in this paper that the numerical sacrifice of fast algorithms, particularly for the typical use cases of practical algorithms, is not prohibitive, and we explore ways to improve the accuracy both theoretically and empirically. The numerical accuracy of fast matrix multiplication depends on properties of the algorithm and of the input matrices, and we consider both contributions independently. We generalize and tighten previous error analyses of fast algorithms and compare their properties. We discuss algorithmic techniques for improving the error guarantees from two perspectives: manipulating the algorithms, and reducing input anomalies by various forms of diagonal scaling. Finally, we benchmark performance and demonstrate our improved numerical accuracy.
BibTeX:
@article{BBDLS16,
  author = {Grey Ballard and Austin R. Benson and Alex Druinsky and Benjamin Lipshitz and Oded Schwartz},
  title = {Improving the Numerical Stability of Fast Matrix Multiplication},
  journal = {SIAM Journal on Matrix Analysis and Applications},
  year = {2016},
  volume = {37},
  number = {4},
  pages = {1382-1418},
  url = {http://dx.doi.org/10.1137/15M1032168}
}

Abstract:
Network topologies can have significant effect on the execution costs of parallel algorithms due to inter-processor communication. For particular combinations of computations and network topologies, costly network contention may inevitably become a bottleneck, even if algorithms are optimally designed so that each processor communicates as little as possible. We obtain novel contention lower bounds that are functions of the network and the computation graph parameters. For several combinations of fundamental computations and common network topologies, our new analysis improves upon previous per-processor lower bounds which only specify the number of words communicated by the busiest individual processor. We consider torus and mesh topologies, universal fat-trees, and hypercubes; algorithms covered include classical matrix multiplication and direct numerical linear algebra, fast matrix multiplication algorithms, programs that reference arrays, N-body computations, and the FFT. For example, we show that fast matrix multiplication algorithms (e.g., Strassen's) running on a 3D torus will suffer from contention bottlenecks. On the other hand, this network is likely sufficient for a classical matrix multiplication algorithm. Our new lower bounds are matched by existing algorithms only in very few cases, leaving many open problems for network and algorithmic design.
BibTeX:
@inproceedings{BD+16,
  author = {Ballard, Grey and Demmel, James and Gearhart, Andrew and Lipshitz, Benjamin and Oltchik, Yishai and Schwartz, Oded and Toledo, Sivan},
  title = {Network Topologies and Inevitable Contention},
  booktitle = {Proceedings of the First Workshop on Optimization of Communication in HPC},
  publisher = {IEEE Press},
  address = {Piscataway, NJ, USA},
  year = {2016},
  series = {COM-HPC '16},
  pages = {39--52},
  url = {http://dl.acm.org/citation.cfm?id=3018063}
}

Abstract:
We propose a fine-grained hypergraph model for sparse matrix-matrix multiplication (SpGEMM), a key computational kernel in scientific computing and data analysis whose performance is often communication bound. This model correctly describes both the interprocessor communication volume along a critical path in a parallel computation and also the volume of data moving through the memory hierarchy in a sequential computation. We show that identifying a communication-optimal algorithm for particular input matrices is equivalent to solving a hypergraph partitioning problem. Our approach is nonzero structure dependent, meaning that we seek the best algorithm for the given input matrices.

In addition to our three-dimensional fine-grained model, we also propose coarse-grained one-dimensional and two-dimensional models that correspond to simpler SpGEMM algorithms. We explore the relations between our models theoretically, and we study their performance experimentally in the context of three applications that use SpGEMM as a key computation. For each application, we find that at least one coarse-grained model is as communication efficient as the fine-grained model. We also observe that different applications have affinities for different algorithms.

Our results demonstrate that hypergraphs are an accurate model for reasoning about the communication costs of SpGEMM as well as a practical tool for exploring the SpGEMM algorithm design space.

BibTeX:
@article{BDKS16,
  author = {Ballard, Grey and Druinsky, Alex and Knight, Nicholas and Schwartz, Oded},
  title = {Hypergraph Partitioning for Sparse Matrix-Matrix Multiplication},
  journal = {ACM Transactions on Parallel Computing},
  publisher = {ACM},
  address = {New York, NY, USA},
  month = {December},
  year = {2016},
  volume = {3},
  number = {3},
  pages = {18:1--18:34},
  url = {http://doi.acm.org/10.1145/3015144}
}

Abstract:
We propose a fine-grained hypergraph model for sparse matrix-matrix multiplication (SpGEMM), a key computational kernel in scientific computing and data analysis whose performance is often communication bound. This model correctly describes both the interprocessor communication volume along a critical path in a parallel computation and also the volume of data moving through the memory hierarchy in a sequential computation. We show that identifying a communication-optimal algorithm for particular input matrices is equivalent to solving a hypergraph partitioning problem. Our approach is sparsity dependent, meaning that we seek the best algorithm for the given input matrices.

In addition to our (3D) fine-grained model, we also propose coarse-grained 1D and 2D models that correspond to simpler SpGEMM algorithms. We explore the relations between our models theoretically, and we study their performance experimentally in the context of three applications that use SpGEMM as a key computation. For each application, we find that at least one coarse-grained model is as communication efficient as the fine-grained model. We also observe that different applications have affinities for different algorithms.

Our results demonstrate that hypergraphs are an accurate model for reasoning about the communication costs of SpGEMM as well as a practical tool for exploring the SpGEMM algorithm design space.

BibTeX:
@techreport{BDKS16-TR,
  author = {Grey Ballard and Alex Druinsky and Nicholas Knight and Oded Schwartz},
  title = {Hypergraph Partitioning for Sparse Matrix-Matrix Multiplication},
  month = {March},
  year = {2016},
  number = {1603.05627},
  url = {http://arxiv.org/abs/1603.05627}
}

Abstract:
We consider the sequence of sparse matrix-matrix multiplications performed during the setup phase of algebraic multigrid. In particular, we show that the most commonly used parallel algorithm is often not the most communication-efficient one for all of the matrix-matrix multiplications involved. By using an alternative algorithm, we show that the communication costs are reduced (in theory and practice), and we demonstrate the performance benefit for both model (structured) and more realistic unstructured problems on large-scale distributed-memory parallel systems. Our theoretical analysis shows that we can reduce communication by a factor of up to 5.4 for a model problem, and we observe in our empirical evaluation communication reductions of factors up to 4.7 for structured problems and 3.7 for unstructured problems. These reductions in communication translate to run-time speedups of factors up to 2.8 and 2.5, respectively.
BibTeX:
@article{BSH16,
  author = {Ballard, Grey and Siefert, Christopher and Hu, Jonathan},
  title = {Reducing Communication Costs for Sparse Matrix Multiplication within Algebraic Multigrid},
  journal = {SIAM Journal on Scientific Computing},
  month = {June},
  year = {2016},
  volume = {38},
  number = {3},
  pages = {C203-C231},
  url = {http://epubs.siam.org/doi/abs/10.1137/15M1028807}
}

Abstract:
Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors W and H, for the given input matrix A, such that WH approximates A. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient distributed algorithms to solve the problem for big data sets.

We propose a high-performance distributed-memory parallel algorithm that computes the factorization by iteratively solving alternating non-negative least squares (NLS) subproblems for W and H. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). As opposed to previous implementations, our algorithm is also flexible: (1) it performs well for both dense and sparse matrices, and (2) it allows the user to choose any one of the multiple algorithms for solving the updates to low rank factors W and H within the alternating iterations. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements.

BibTeX:
@inproceedings{KBP16,
  author = {Kannan, Ramakrishnan and Ballard, Grey and Park, Haesun},
  title = {A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization},
  booktitle = {Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
  publisher = {ACM},
  address = {New York, NY, USA},
  month = {February},
  year = {2016},
  series = {PPoPP '16},
  pages = {9:1--9:11},
  url = {http://doi.acm.org/10.1145/2851141.2851152}
}

Abstract:
Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors W and H, for the given input matrix A, such that A is approximated by WH. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel algorithms to solve the problem for big data sets.
The main contribution of this work is a new, high-performance parallel computational framework for a broad class of NMF algorithms that iteratively solves alternating non-negative least squares (NLS) subproblems for W and H. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). The framework is flexible and able to leverage a variety of NMF and NLS algorithms, including Multiplicative Update, Hierarchical Alternating Least Squares, and Block Principal Pivoting. Our implementation allows us to benchmark and compare different algorithms on massive dense and sparse data matrices of size that spans for few hundreds of millions to billions. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements. The code and the datasets used for conducting the experiments are available online.
BibTeX:
@techreport{KBP16-TR,
  author = {Ramakrishnan Kannan and Grey Ballard and Haesun Park},
  title = {MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization},
  month = {September},
  year = {2016},
  number = {1609.09154},
  url = {https://arxiv.org/abs/1609.09154}
}

Abstract:
Many large-scale scientific computations require eigenvalue solvers in a scaling regime where efficiency is limited by data movement. We introduce a parallel algorithm for computing the eigenvalues of a dense symmetric matrix, which performs asymptotically less communication than previously known approaches. We provide analysis in the Bulk Synchronous Parallel (BSP) model with additional consideration for communication between a local memory and cache. Given sufficient memory to store c copies of the symmetric matrix, our algorithm requires (c) less interprocessor communication than previously known algorithms, for any c≤ p^1/3 when using p processors. The algorithm first reduces the dense symmetric matrix to a banded matrix with the same eigenvalues. Subsequently, the algorithm employs successive reduction to O(log p) thinner banded matrices. We employ two new parallel algorithms that achieve lower communication costs for the full-to-band and band-to-band reductions. Both of these algorithms leverage a novel QR factorization algorithm for rectangular matrices.
BibTeX:
@techreport{SBDH16-TR,
  author = {Solomonik, Edgar and Ballard, Grey and Demmel, James and Hoefler, Torsten},
  title = {A communication-avoiding parallel algorithm for the symmetric eigenvalue problem},
  month = {April},
  year = {2016},
  number = {1604.03703},
  url = {http://arxiv.org/abs/1604.03703}
}

Abstract:
As parallel computing trends towards the exascale, scientific data produced by high-fidelity simulations are growing increasingly massive. For instance, a simulation on a three-dimensional spatial grid with 512 points per dimension that tracks 64 variables per grid point for 128 time steps yields 8 TB of data, assuming double precision. By viewing the data as a dense five-way tensor, we can compute a Tucker decomposition to find inherent low-dimensional multilinear structure, achieving compression ratios of up to 5000 on real-world data sets with negligible loss in accuracy. So that we can operate on such massive data, we present the first-ever distributed-memory parallel implementation for the Tucker decomposition, whose key computations correspond to parallel linear algebra operations, albeit with nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel. We provide accompanying analysis of the computation and communication costs of the algorithms. To demonstrate the compression and accuracy of the method, we apply our approach to real-world data sets from combustion science simulations. We also provide detailed performance results, including parallel performance in both weak and strong scaling experiments.
BibTeX:
@techreport{ABK15-TR,
  author = {Woody Austin and Grey Ballard and Tamara G. Kolda},
  title = {Parallel Tensor Compression for Large-Scale Scientific Data},
  year = {2015},
  number = {1510.06689},
  url = {http://arxiv.org/abs/1510.06689}
}

Abstract:
Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdos-Renyi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first ever implementation of the 3D SpGEMM formulation that also exploits multiple (intra-node and inter-node) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research.
BibTeX:
@techreport{AB+15-TR,
  author = {Ariful Azad and Grey Ballard and Aydin Buluc and James Demmel and Laura Grigori and Oded Schwartz and Sivan Toledo and Samuel Williams},
  title = {Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication},
  month = {October},
  year = {2015},
  number = {1510.00844},
  url = {http://arxiv.org/abs/1510.00844}
}

Abstract:
Fast algorithms for matrix multiplication, namely those that perform asymptotically fewer scalar operations than the classical algorithm, have been considered primarily of theoretical interest. Apart from Strassen's original algorithm, few fast algorithms have been efficiently implemented or used in practical applications. However, there exist many practical alternatives to Strassen's algorithm with varying performance and numerical properties. Fast algorithms are known to be numerically stable, but because their error bounds are slightly weaker than the classical algorithm, they are not used even in cases where they provide a performance benefit.

We argue in this paper that the numerical sacrifice of fast algorithms, particularly for the typical use cases of practical algorithms, is not prohibitive, and we explore ways to improve the accuracy both theoretically and empirically. The numerical accuracy of fast matrix multiplication depends on properties of the algorithm and of the input matrices, and we consider both contributions independently. We generalize and tighten previous error analyses of fast algorithms and compare their properties. We discuss algorithmic techniques for improving the error guarantees from two perspectives: manipulating the algorithms, and reducing input anomalies by various forms of diagonal scaling. Finally, we benchmark performance and demonstrate our improved numerical accuracy.

BibTeX:
@techreport{BBDLS15-TR,
  author = {Grey Ballard and Austin R. Benson and Alex Druinsky and Benjamin Lipshitz and Oded Schwartz},
  title = {Improving the numerical stability of fast matrix multiplication algorithms},
  month = {July},
  year = {2015},
  number = {1507.00687},
  url = {http://arxiv.org/abs/1507.00687}
}

Abstract:
The Tall-Skinny QR (TSQR) algorithm is more communication efficient than the standard Householder algorithm for QR decomposition of matrices with many more rows than columns. However, TSQR produces a different representation of the orthogonal factor and therefore requires more software development to support the new representation. Further, implicitly applying the orthogonal factor to the trailing matrix in the context of factoring a square matrix is more complicated and costly than with the Householder representation.

We show how to perform TSQR and then reconstruct the Householder vector representation with the same asymptotic communication efficiency and little extra computational cost. We demonstrate the high performance and numerical stability of this algorithm both theoretically and empirically. The new Householder reconstruction algorithm allows us to design more efficient parallel QR algorithms, with significantly lower latency cost compared to Householder QR and lower bandwidth and latency costs compared with Communication-Avoiding QR (CAQR) algorithm. Experiments on supercomputers demonstrate the benefits of the communication cost improvements: in particular, our experiments show substantial improvements over tuned library implementations for tall-and-skinny matrices. We also provide algorithmic improvements to the Householder QR and CAQR algorithms, and we investigate several alternatives to the Householder reconstruction algorithm that sacrifice guarantees on numerical stability in some cases in order to obtain higher performance.

BibTeX:
@article{BDGJ+15,
  author = {Ballard, Grey and Demmel, James and Grigori, Laura and Knight, Nicholas and Jacquelin, Mathias and Nguyen, Hong Diep},
  title = {Reconstructing {H}ouseholder Vectors from Tall-Skinny {QR}},
  journal = {Journal of Parallel and Distributed Computing},
  month = {August},
  year = {2015},
  volume = {85},
  pages = {3-31},
  url = {http://www.sciencedirect.com/science/article/pii/S074373151500101X}
}

Abstract:
The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present sequential and distributed-memory parallel algorithms for tridiagonalizing full symmetric and symmetric band matrices that asymptotically reduce communication compared to previous approaches. The tridiagonalization of a symmetric band matrix is a key kernel in solving the symmetric eigenvalue problem for both full and band matrices. In order to preserve structure, tridiagonalization routines use annihilate-and-chase procedures that previously have suffered from poor data locality and high parallel latency cost. We improve both by reorganizing the computation and obtain asymptotic improvements. We also propose new algorithms for reducing a full symmetric matrix to band form in a communication-efficient manner. In this article, we consider the cases of computing eigenvalues only and of computing eigenvalues and all eigenvectors.
BibTeX:
@article{BDK15,
  author = {Ballard, Grey and Demmel, James and Knight, Nicholas},
  title = {Avoiding Communication in Successive Band Reduction},
  journal = {ACM Transactions on Parallel Computing},
  publisher = {ACM},
  address = {New York, NY, USA},
  month = {February},
  year = {2015},
  volume = {1},
  number = {2},
  pages = {11:1--11:37},
  url = {http://doi.acm.org/10.1145/2686877}
}

Abstract:
The performance of parallel algorithms for sparse matrix-matrix multiplication is typically determined by the amount of interprocessor communication performed, which in turn depends on the nonzero structure of the input matrices. In this paper, we characterize the communication cost of a sparse matrix-matrix multiplication algorithm in terms of the size of a cut of an associated hypergraph that encodes the computation for a given input nonzero structure. Obtaining an optimal algorithm corresponds to solving a hypergraph partitioning problem. Our hypergraph model generalizes several existing models for sparse matrix-vector multiplication, and we can leverage hypergraph partitioners developed for that computation to improve application-specific algorithms for multiplying sparse matrices.
BibTeX:
@inproceedings{BDKS15,
  author = {Ballard, Grey and Druinsky, Alex and Knight, Nicholas and Schwartz, Oded},
  title = {Brief Announcement: Hypergraph Partitioning for Parallel Sparse Matrix-Matrix Multiplication},
  booktitle = {Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures},
  publisher = {ACM},
  address = {New York, NY, USA},
  month = {June},
  year = {2015},
  series = {SPAA '15},
  pages = {86--88},
  url = {http://doi.acm.org/10.1145/2755573.2755613}
}

Awarded the Best Paper prize.

Abstract:
Given two sets of vectors A and B, our problem is to find the top-t dot products among all possible pairs where one vector is taken from each set. This is a fundamental mathematical problem that appears in numerous data applications involving similarity search, link prediction, and collaborative filtering. We propose a sampling-based approach that avoids direct computation of all dot products. We select diamonds (i.e., four-cycles) from the weighted tripartite representation of A and B. The probability of selecting a diamond corresponding to pair (i,j) is proportional to the square of the dot product between vectors i (from A) and j (from B), amplifying the focus on the largest-magnitude entries. Experimental results indicate that diamond sampling is orders of magnitude faster than direct computation and requires far fewer samples than any competing approach. We also apply diamond sampling to the special case of maximum inner product search, and get significantly better results than the state-of-the-art hashing methods.
BibTeX:
@inproceedings{BKPS15,
  author = {Grey Ballard and Tamara G. Kolda and Ali Pinar and C. Seshadhri},
  title = {Diamond Sampling for Approximate Maximum All-pairs Dot-product {(MAD)} Search},
  booktitle = {15th IEEE International Conference on Data Mining},
  publisher = {IEEE Computer Society},
  month = {November},
  year = {2015},
  series = {ICDM '15},
  pages = {11--20},
  url = {http://dx.doi.org/10.1109/ICDM.2015.46}
}

Abstract:
Given two sets of vectors A and B, our problem is to find the top-t dot products among all possible pairs where one vector is taken from each set. This is a fundamental mathematical problem that appears in numerous data applications involving similarity search, link prediction, and collaborative filtering. We propose a sampling-based approach that avoids direct computation of all dot products. We select diamonds (i.e., four-cycles) from the weighted tripartite representation of A and B. The probability of selecting a diamond corresponding to pair (i,j) is proportional to the square of the dot product between vectors i (from A) and j (from B), amplifying the focus on the largest-magnitude entries. Experimental results indicate that diamond sampling is orders of magnitude faster than direct computation and requires far fewer samples than any competing approach. We also apply diamond sampling to the special case of maximum inner product search, and get significantly better results than the state-of-the-art hashing methods.
BibTeX:
@techreport{BKPS15-TR,
  author = {Grey Ballard and Tamara G. Kolda and Ali Pinar and C. Seshadhri},
  title = {Diamond Sampling for Approximate Maximum All-pairs Dot-product {(MAD)} Search},
  month = {June},
  year = {2015},
  url = {http://arxiv.org/abs/1506.03872}
}

Abstract:
We consider the sequence of sparse matrix-matrix multiplications performed during the setup phase of algebraic multigrid. In particular, we show that the most commonly used parallel algorithm is often not the most communication-efficient one for all of the matrix-matrix multiplications involved. By using an alternative algorithm, we show that the communication costs are reduced (in theory and practice), and we demonstrate the performance benefit for both model (structured) and more realistic unstructured problems on large-scale distributed-memory parallel systems. Our theoretical analysis shows that we can reduce communication by a factor of up to 5.4 for a model problem, and we observe in our empirical evaluation communication reductions of factors up to 4.7 for structured problems and 3.7 for unstructured problems. These reductions in communication translate to run-time speedups of up to factors of 2.3 and 2.5, respectively.
BibTeX:
@techreport{BSH15-TR,
  author = {Ballard, Grey and Siefert, Christopher and Hu, Jonathan},
  title = {Reducing Communication Costs for Sparse Matrix Multiplication within Algebraic Multigrid},
  month = {September},
  year = {2015},
  number = {SAND2015-3275},
  url = {http://prod.sandia.gov/techlib/access-control.cgi/2015/153275.pdf}
}

Abstract:
Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's fast algorithm on modest problem sizes and shapes. Furthermore, we show that the best choice of fast algorithm depends not only on the size of the matrices but also the shape. We develop a code generation tool to automatically implement multiple sequential and shared-memory parallel variants of each fast algorithm, including our novel parallelization scheme. This allows us to rapidly benchmark over 20 fast algorithms on several problem sizes. Furthermore, we discuss a number of practical implementation issues for these algorithms on shared-memory machines that can direct further research on making fast algorithms practical.
BibTeX:
@inproceedings{BB15,
  author = {Benson, Austin R. and Ballard, Grey},
  title = {A Framework for Practical Parallel Fast Matrix Multiplication},
  booktitle = {Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
  publisher = {ACM},
  address = {New York, NY, USA},
  month = {February},
  year = {2015},
  series = {PPoPP 2015},
  pages = {42--53},
  url = {http://doi.acm.org/10.1145/2688500.2688513}
}

Abstract:
Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors W and H, for the given input matrix A, such that A is approximated by WH. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient distributed algorithm to solve the problem for big datasets. Existing distributed-memory algorithms are limited in terms of performance and applicability, as they are implemented using Hadoop and are designed only for sparse matrices.

We propose a distributed-memory parallel algorithm that computes the factorization by iteratively solving alternating non-negative least squares (NLS) subproblems for W and H. To our knowledge, our algorithm is the first high-performance parallel algorithm for NMF. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). As opposed to previous implementations, our algorithm is also flexible: (1) it performs well for both dense and sparse matrices, and (2) it allows the user to choose from among multiple algorithms for solving local NLS subproblems within the alternating iterations. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements.

BibTeX:
@techreport{KBP15-TR,
  author = {Ramakrishnan Kannan and Grey Ballard and Haesun Park},
  title = {A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization},
  month = {September},
  year = {2015},
  number = {1509.09313},
  url = {http://arxiv.org/abs/1509.09313}
}

BibTeX:
@article{BB14,
  author = {Grey Ballard and John Baxley},
  title = {Asymptotic Behavior of the Eigenvalues of Toeplitz Integral Operators Associated with the Hankel Transform},
  journal = {Dynamic Systems and Applications},
  month = {December},
  year = {2014},
  volume = {23},
  number = {4},
  pages = {505--530}
}

Abstract:
We describe and analyze a novel symmetric triangular factorization algorithm. The algorithm is essentially a block version of Aasen's triangular tridiagonalization. It factors a dense symmetric matrix A as the product A=PLTL^TP^T, where P is a permutation matrix, L is lower triangular, and T is block tridiagonal and banded. The algorithm is the first symmetric-indefinite communication-avoiding factorization: it performs an asymptotically optimal amount of communication in a two-level memory hierarchy for almost any cache-line size. Adaptations of the algorithm to parallel computers are likely to be communication efficient as well; one such adaptation has been recently published. The current paper describes the algorithm, proves that it is numerically stable, and proves that it is communication optimal.
BibTeX:
@article{BBDD+14,
  author = {Ballard, G. and Becker, D. and Demmel, J. and Dongarra, J. and Druinsky, A. and Peled, I. and Schwartz, O. and Toledo, S. and Yamazaki, I.},
  title = {Communication-Avoiding Symmetric-Indefinite Factorization},
  journal = {SIAM Journal on Matrix Analysis and Applications},
  month = {November},
  year = {2014},
  volume = {35},
  number = {4},
  pages = {1364--1406},
  url = {http://dx.doi.org/10.1137/130929060}
}

Invited paper.

Abstract:
The traditional metric for the efficiency of a numerical algorithm has been the number of arithmetic operations it performs. Technological trends have long been reducing the time to perform an arithmetic operation, so it is no longer the bottleneck in many algorithms; rather, communication, or moving data, is the bottleneck. This motivates us to seek algorithms that move as little data as possible, either between levels of a memory hierarchy or between parallel processors over a network. In this paper we summarize recent progress in three aspects of this problem. First we describe lower bounds on communication. Some of these generalize known lower bounds for dense classical (O(n^3)) matrix multiplication to all direct methods of linear algebra, to sequential and parallel algorithms, and to dense and sparse matrices. We also present lower bounds for Strassen-like algorithms, and for iterative methods, in particular Krylov subspace methods applied to sparse matrices. Second, we compare these lower bounds to widely used versions of these algorithms, and note that these widely used algorithms usually communicate asymptotically more than is necessary. Third, we identify or invent new algorithms for most linear algebra problems that do attain these lower bounds, and demonstrate large speed-ups in theory and practice.
BibTeX:
@article{BCDH+14,
  author = {Ballard, G. and Carson, E. and Demmel, J. and Hoemmen, M. and Knight, N. and Schwartz, O.},
  title = {Communication lower bounds and optimal algorithms for numerical linear algebra},
  journal = {Acta Numerica},
  month = {May},
  year = {2014},
  volume = {23},
  pages = {1--155},
  url = {http://journals.cambridge.org/article_S0962492914000038}
}

Abstract:
The Tall-Skinny QR (TSQR) algorithm is more communication efficient than the standard Householder algorithm for QR decomposition of matrices with many more rows than columns. However, TSQR produces a different representation of the orthogonal factor and therefore requires more software development to support the new representation. Further, implicitly applying the orthogonal factor to the trailing matrix in the context of factoring a square matrix is more complicated and costly than with the Householder representation. We show how to perform TSQR and then reconstruct the Householder vector representation with the same asymptotic communication efficiency and little extra computational cost. We demonstrate the high performance and numerical stability of this algorithm both theoretically and empirically. The new Householder reconstruction algorithm allows us to design more efficient parallel QR algorithms, with significantly lower latency cost compared to Householder QR and lower bandwidth and latency costs compared with Communication-Avoiding QR (CAQR) algorithm. As a result, our final parallel QR algorithm outperforms ScaLAPACK and Elemental implementations of Householder QR and our implementation of CAQR on the Hopper Cray XE6 NERSC system. We also provide algorithmic improvements to the ScaLAPACK and CAQR algorithms.
BibTeX:
@inproceedings{BDGJ+14,
  author = {Ballard, Grey and Demmel, James and Grigori, Laura and Jacquelin, Mathias and Nguyen, Hong Diep and Solomonik, Edgar},
  title = {Reconstructing {H}ouseholder Vectors from Tall-Skinny {QR}},
  booktitle = {IEEE 28th International Parallel and Distributed Processing Symposium},
  month = {May},
  year = {2014},
  pages = {1159--1170},
  url = {http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6877344}
}

Invited to appear as Research Highlight.

Abstract:
Algorithms have historically been evaluated in terms of the number of arithmetic operations they performed. This analysis is no longer sufficient for predicting running times on today's machines. Moving data through memory hierarchies and among processors requires much more time (and energy) than performing computations. Hardware trends suggest that the relative costs of this communication will only increase. Proving lower bounds on the communication of algorithms and finding algorithms that attain these bounds are therefore fundamental goals. We show that the communication cost of an algorithm is closely related to the graph expansion properties of its corresponding computation graph. Matrix multiplication is one of the most fundamental problems in scientific computing and in parallel computing. Applying expansion analysis to Strassen's and other fast matrix multiplication algorithms, we obtain the first lower bounds on their communication costs. These bounds show that the current sequential algorithms are optimal but that previous parallel algorithms communicate more than necessary. Our new parallelization of Strassen's algorithm is communication-optimal and outperforms all previous matrix multiplication algorithms.
BibTeX:
@article{BDHS14,
  author = {Ballard, Grey and Demmel, James and Holtz, Olga and Schwartz, Oded},
  title = {Communication Costs of {Strassen's} Matrix Multiplication},
  journal = {Communications of the ACM},
  publisher = {ACM},
  address = {New York, NY, USA},
  month = {February},
  year = {2014},
  volume = {57},
  number = {2},
  pages = {107--114},
  url = {http://doi.acm.org/10.1145/2556647.2556660}
}

Abstract:
Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's fast algorithm on modest problem sizes and shapes. Furthermore, we show that the best choice of fast algorithm depends not only on the size of the matrices but also the shape. We develop a code generation tool to automatically implement multiple sequential and shared-memory parallel variants of each fast algorithm, including our novel parallelization scheme. This allows us to rapidly benchmark over 20 fast algorithms on several problem sizes. Furthermore, we discuss a number of practical implementation issues for these algorithms on shared-memory machines that can direct further research on making fast algorithms practical.
BibTeX:
@techreport{BB14-TR,
  author = {Benson, Austin and Ballard, Grey},
  title = {A Framework for Practical Parallel Fast Matrix Multiplication},
  month = {September},
  year = {2014},
  url = {http://arxiv.org/abs/1409.2908}
}

Awarded the 2013 ACM Doctoral Dissertation Award - Honorable Mention.

Abstract:
Dense linear algebra computations are essential to nearly every problem in scientific computing and to countless other fields. Most matrix computations enjoy a high computational intensity (i.e., ratio of computation to data), and therefore the algorithms for the computations have a potential for high efficiency. However, performance for many linear algebra algorithms is limited by the cost of moving data between processors on a parallel computer or throughout the memory hierarchy of a single processor, which we will refer to generally as communication. Technological trends indicate that algorithmic performance will become even more limited by communication in the future. In this thesis, we consider the fundamental computations within dense linear algebra and address the following question: can we significantly improve the current algorithms for these computations, in terms of the communication they require and their performance in practice? To answer the question, we analyze algorithms on sequential and parallel architectural models that are simple enough to determine coarse communication costs but accurate enough to predict performance of implementations on real hardware. For most of the computations, we prove lower bounds on the communication that any algorithm must perform. If an algorithm exists with communication costs that match the lower bounds (at least in an asymptotic sense), we call the algorithm communication optimal. In many cases, the most commonly used algorithms are not communication optimal, and we can develop new algorithms that require less data movement and attain the communication lower bounds. In this thesis, we develop both new communication lower bounds and new algorithms, tightening (and in many cases closing) the gap between best known lower bound and best known algorithm (or upper bound). We consider both sequential and parallel algorithms, and we asses both classical and fast algorithms (e.g., Strassen's matrix multiplication algorithm). In particular, the central contributions of this thesis are: proving new communication lower bounds for nearly all classical direct linear algebra computations (dense or sparse), including factorizations for solving linear systems, least squares problems, and eigenvalue and singular value problems; proving new communication lower bounds for Strassen's and other fast matrix multiplication algorithms; proving new parallel communication lower bounds for classical and fast computations that set limits on an algorithm's ability to perfectly strong scale; summarizing the state-of-the-art in communication efficiency for both sequential and parallel algorithms for the computations to which the lower bounds apply; developing a new communication-optimal algorithm for computing a symmetric-indefinite factorization (observing speedups of up to 2.8x compared to alternative shared-memory parallel algorithms); developing new, more communication-efficient algorithms for reducing a symmetric band matrix to tridiagonal form via orthogonal similar transformations (observing speedups of 2--6x compared to alternative sequential and parallel algorithms); and developing a new communication-optimal parallelization of Strassen's matrix multiplication algorithm (observing speedups of up to 2.84x compared to alternative distributed-memory parallel algorithms).
BibTeX:
@phdthesis{Ballard13,
  author = {Ballard, Grey},
  title = {Avoiding Communication in Dense Linear Algebra},
  month = {Aug},
  year = {2013},
  number = {UCB/EECS-2013-151},
  url = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-151.html}
}

BibTeX:
@techreport{BBDD+13-TR,
  author = {Ballard, G. and Becker, D. and Demmel, J. and Dongarra, J. and Druinsky, A. and Peled, I. and Schwartz, O. and Toledo, S. and Yamazaki, I.},
  title = {Communication-Avoiding Symmetric-Indefinite Factorization},
  month = {July},
  year = {2013},
  number = {UCB/EECS-2013-127},
  url = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-127.html}
}

Awarded the Best Paper prize in the Algorithms track.

Abstract:
Factorization of a dense symmetric indefinite matrix is a key computational kernel in many scientific and engineering simulations. However, there is no scalable factorization algorithm that takes advantage of the symmetry and guarantees numerical stability through pivoting at the same time. This is because such an algorithm exhibits many of the fundamental challenges in parallel programming like irregular data accesses and irregular task dependencies. In this paper, we address these challenges in a tiled implementation of a blocked Aasen's algorithm using a dynamic scheduler. To fully exploit the limited parallelism in this left-looking algorithm, we study several performance enhancing techniques; e.g., parallel reduction to update a panel, tall-skinny LU factorization algorithms to factorize the panel, and a parallel implementation of symmetric pivoting. Our performance results on up to 48 AMD Opteron processors demonstrate that our implementation obtains speedups of up to 2.8 over MKL, while losing only one or two digits in the computed residual norms.
BibTeX:
@inproceedings{BBDD+13,
  author = {Ballard, G. and Becker, D. and Demmel, J. and Dongarra, J. and Druinsky, A. and Peled, I. and Schwartz, O. and Toledo, S. and Yamazaki, I.},
  title = {Implementing a Blocked {Aasen's} Algorithm with a Dynamic Scheduler on Multicore Architectures},
  booktitle = {Proceedings of the 27th IEEE International Parallel Distributed Processing Symposium},
  month = {May},
  year = {2013},
  series = {IPDPS '13},
  pages = {895--907},
  url = {http://dx.doi.org/10.1109/IPDPS.2013.98}
}

Abstract:
Parallel algorithms for sparse matrix-matrix multiplication typically spend most of their time on inter-processor communication rather than on computation, and hardware trends predict the relative cost of communication will only increase. Thus, sparse matrix multiplication algorithms must minimize communication costs in order to scale to large processor counts. In this paper, we consider multiplying sparse matrices corresponding to Erdos-Renyi random graphs on distributed-memory parallel machines. We prove a new lower bound on the expected communication cost for a wide class of algorithms. Our analysis of existing algorithms shows that, while some are optimal for a limited range of matrix density and number of processors, none is optimal in general. We obtain two new parallel algorithms and prove that they match the expected communication cost lower bound, and hence they are optimal.
BibTeX:
@inproceedings{BBDG+13,
  author = {Ballard, G. and Bulu\c{c}, A. and Demmel, J. and Grigori, L. and Lipshitz, B. and Schwartz, O. and Toledo, S.},
  title = {Communication optimal parallel multiplication of sparse random matrices},
  booktitle = {Proceedings of the 25th ACM Symposium on Parallelism in Algorithms and Architectures},
  publisher = {ACM},
  address = {Montr\'eal, Qu\'ebec, Canada},
  month = {June},
  year = {2013},
  series = {SPAA '13},
  pages = {222--231},
  url = {http://doi.acm.org/10.1145/2486159.2486196}
}

Abstract:
The Tall-Skinny QR (TSQR) algorithm is more communication efficient than the standard Householder algorithm for QR decomposition of matrices with many more rows than columns. However, TSQR produces a different representation of the orthogonal factor and therefore requires more software development to support the new representation. Further, implicitly applying the orthogonal factor to the trailing matrix in the context of factoring a square matrix is more complicated and costly than with the Householder representation. We show how to perform TSQR and then reconstruct the Householder vector representation with the same asymptotic communication efficiency and little extra computational cost. We demonstrate the high performance and numerical stability of this algorithm both theoretically and empirically. The new Householder reconstruction algorithm allows us to design more efficient parallel QR algorithms, with significantly lower latency cost compared to Householder QR and lower bandwidth and latency costs compared with Communication-Avoiding QR (CAQR) algorithm. As a result, our final parallel QR algorithm outperforms ScaLAPACK and Elemental implementations of Householder QR and our implementation of CAQR on the Hopper Cray XE6 NERSC system.
BibTeX:
@techreport{BDGJ+13-TR,
  author = {Ballard, G. and Demmel, J. and Grigori, L. and Jacquelin, M. and Nguyen, H.D. and Solomonik, E.},
  title = {Reconstructing {H}ouseholder Vectors from Tall-Skinny {QR}},
  month = {Oct},
  year = {2013},
  number = {UCB/EECS-2013-175},
  url = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-175.html}
}

BibTeX:
@techreport{BDK13-TR,
  author = {Ballard, G. and Demmel, J. and Knight, N.},
  title = {Avoiding Communication in Successive Band Reduction},
  month = {July},
  year = {2013},
  number = {UCB/EECS-2013-131},
  url = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-131.html}
}

Abstract:
High performance for numerical linear algebra often comes at the expense of stability. Computing the LU decomposition of a matrix via Gaussian Elimination can be organized so that the computation involves regular and efficient data access. However, maintaining numerical stability via partial pivoting involves row interchanges that lead to inefficient data access patterns. To optimize communication efficiency throughout the memory hierarchy we confront two seemingly contradictory requirements: partial pivoting is efficient with column-major layout, whereas a block-recursive layout is optimal for the rest of the computation. We resolve this by introducing a shape morphing procedure that dynamically matches the layout to the computation throughout the algorithm, and show that Gaussian Elimination with partial pivoting can be performed in a communication efficient and cache-oblivious way. Our technique extends to QR decomposition, where computing Householder vectors prefers a different data layout than the rest of the computation.
BibTeX:
@inproceedings{BDLST13,
  author = {Ballard, G. and Demmel, J. and Lipshitz, B. and Schwartz, O. and Toledo, S.},
  title = {Communication efficient {Gaussian} elimination with partial pivoting using a shape morphing data layout},
  booktitle = {Proceedings of the 25th ACM Symposium on Parallelism in Algorithms and Architectures},
  publisher = {ACM},
  address = {Montr\'eal, Qu\'ebec, Canada},
  month = {June},
  year = {2013},
  series = {SPAA '13},
  pages = {232--240},
  url = {http://doi.acm.org/10.1145/2486159.2486198}
}

Abstract:
A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz, 2012). Both algorithms scale perfectly, but only up to some number of processors where the inter-processor communication no longer scales. We obtain a memory-independent communication cost lower bound on classical and Strassen-based distributed-memory matrix multiplication algorithms. These bounds imply that no classical or Strassen-based parallel matrix multiplication algorithm can strongly scale perfectly beyond the ranges already attained by the two parallel algorithms mentioned above. The memory-independent bounds and the strong scaling bounds generalize to other algorithms.
BibTeX:
@inproceedings{BDHLS12-SS,
  author = {G. Ballard and J. Demmel and O. Holtz and B. Lipshitz and O. Schwartz},
  title = {Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds},
  booktitle = {{Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures}},
  publisher = {ACM},
  address = {New York, NY, USA},
  month = {June},
  year = {2012},
  series = {SPAA '12},
  pages = {77--79},
  url = {http://doi.acm.org/10.1145/2312005.2312021}
}

Abstract:
Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA '11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range. Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n=94080, where the number of processors ranges from 49 to 7203. Our parallelization approach generalizes to other fast matrix multiplication algorithms.
BibTeX:
@inproceedings{BDHLS12-CAPS,
  author = {G. Ballard and J. Demmel and O. Holtz and B. Lipshitz and O. Schwartz},
  title = {Communication-optimal parallel algorithm for {S}trassen's matrix multiplication},
  booktitle = {{Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures}},
  publisher = {ACM},
  address = {New York, NY, USA},
  month = {June},
  year = {2012},
  series = {SPAA '12},
  pages = {193--204},
  url = {http://doi.acm.org/10.1145/2312005.2312044}
}

Abstract:
Graph expansion analysis of computational DAGs is useful for obtaining communication cost lower bounds where previous methods, such as geometric embedding, are not applicable. This has recently been demonstrated for Strassen's and Strassen-like fast square matrix multiplication algorithms. Here we extend the expansion analysis approach to fast algorithms for rectangular matrix multiplication, obtaining a new class of communication cost lower bounds. These apply, for example to the algorithms of Bini et al. (1979) and the algorithms of Hopcroft and Kerr (1971). Some of our bounds are proved to be optimal.
BibTeX:
@incollection{BDHLS12-RECT,
  author = {Ballard, G. and Demmel, J. and Holtz, O. and Lipshitz, B. and Schwartz, O.},
  title = {Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication},
  booktitle = {Design and Analysis of Algorithms},
  editor = {Even, G. and Rawitz, D.},
  publisher = {Springer Berlin Heidelberg},
  month = {December},
  year = {2012},
  series = {Lecture Notes in Computer Science},
  volume = {7659},
  pages = {13--36},
  url = {http://dx.doi.org/10.1007/978-3-642-34862-4_2}
}

BibTeX:
@techreport{BDHLS12-SS-TR,
  author = {Ballard, G. and Demmel, J. and Holtz, O. and Lipshitz, B. and Schwartz, O.},
  title = {Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds},
  month = {March},
  year = {2012},
  number = {UCB/EECS-2012-31},
  url = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-31.html}
}

Abstract:
The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communication costs. In the sequential case, where the processor has a fast memory of size M, too small to store three n-by-n matrices, the lower bound on the number of words moved between fast and slow memory is, for a large class of matrix multiplication algorithms, ( (n/√ M)^_0 ⋅ M), where _0 is the exponent in the arithmetic count (e.g., _0 = lg 7 for Strassen, and _0 = 3 for conventional matrix multiplication). With p parallel processors, each with fast memory of size M, the lower bound is asymptotically lower by a factor of p. These bounds are attainable both for sequential and for parallel algorithms and hence optimal.
BibTeX:
@article{BDHS12,
  author = {G. Ballard and J. Demmel and O. Holtz and O. Schwartz},
  title = {Graph expansion and communication costs of fast matrix multiplication},
  journal = {Journal of the ACM},
  publisher = {ACM},
  address = {New York, NY, USA},
  month = {December},
  year = {2012},
  volume = {59},
  number = {6},
  pages = {32:1--32:23},
  url = {http://doi.acm.org/10.1145/2395116.2395121}
}

BibTeX:
@techreport{BDHS12-FastLA,
  author = {G. Ballard and J. Demmel and O. Holtz and O. Schwartz},
  title = {Sequential Communication Bounds for Fast Linear Algebra},
  month = {March},
  year = {2012},
  number = {UCB/EECS-2012-36},
  url = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-36.html}
}

Abstract:
The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present both theoretical and practical results for tridiagonalizing a symmetric band matrix: we present an algorithm that asymptotically reduces communication, and we show that it indeed performs well in practice. The tridiagonalization of a symmetric band matrix is a key kernel in solving the symmetric eigenvalue problem for both full and band matrices. In order to preserve sparsity, tridiagonalization routines use annihilate-and-chase procedures that previously have suffered from poor data locality. We improve data locality by reorganizing the computation, asymptotically reducing communication costs compared to existing algorithms. Our sequential implementation demonstrates that avoiding communication improves runtime even at the expense of extra arithmetic: we observe a 2x speedup over Intel MKL while doing 43% more floating point operations. Our parallel implementation targets shared-memory multicore platforms. It uses pipelined parallelism and a static scheduler while retaining the locality properties of the sequential algorithm. Due to lightweight synchronization and effective data reuse, we see 9.5x scaling over our serial code and up to 6x speedup over the PLASMA library, comparing parallel performance on a ten-core processor.
BibTeX:
@inproceedings{BDK12,
  author = {G. Ballard and J. Demmel and N. Knight},
  title = {Communication avoiding successive band reduction},
  booktitle = {Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming},
  publisher = {ACM},
  address = {New York, NY, USA},
  month = {February},
  year = {2012},
  series = {PPoPP '12},
  pages = {35--44},
  url = {http://doi.acm.org/10.1145/2145816.2145822}
}

Abstract:
Matrix multiplication is a fundamental kernel of many high performance and scientific computing applications. Most parallel implementations use classical O(n3) matrix multiplication, even though there exist algorithms with lower arithmetic complexity. We recently presented a new Communication-Avoiding Parallel Strassen algorithm (CAPS), based on Strassen's fast matrix multiplication, that minimizes communication (SPAA '12). It communicates asymptotically less than all classical and all previous Strassen-based algorithms, and it attains theoretical lower bounds. In this paper we show that CAPS is also faster in practice. We benchmark and compare its performance to previous algorithms on Hopper (Cray XE6), Intrepid (IBM BG/P), and Franklin (Cray XT4). We demonstrate significant speedups over previous algorithms both for large matrices and for small matrices on large numbers of processors. We model and analyze the performance of CAPS and predict its performance on future exascale platforms.
BibTeX:
@inproceedings{LBDS12,
  author = {Lipshitz, B. and Ballard, G. and Demmel, J. and Schwartz, O.},
  title = {Communication-avoiding parallel {S}trassen: {I}mplementation and performance},
  booktitle = {Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis},
  address = {Salt Lake City, Utah},
  month = {November},
  year = {2012},
  series = {SC '12},
  pages = {101:1--101:11},
  url = {http://dl.acm.org/citation.cfm?id=2388996.2389133}
}

Abstract:
We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a general-purpose processor or using entirely bandwidth-bound operations, incurring data transfer overheads. In contrast, our QR is done entirely on the GPU using compute-bound kernels, meaning performance is good regardless of the width of the matrix. As a result, we outperform CULA, a parallel linear algebra library for GPUs by up to 17x for tall-skinny matrices and Intel's Math Kernel Library (MKL) by up to 12x. We also discuss stationary video background subtraction as a motivating application. We apply a recent statistical approach, which requires many iterations of computing the singular value decomposition of a tall-skinny matrix. Using CAQR as a first step to getting the singular value decomposition, we are able to get the answer 3x faster than if we use a traditional bandwidth-bound GPU QR factorization tuned specifically for that matrix size, and 30x faster than if we use Intel's Math Kernel Library (MKL) singular value decomposition routine on a multicore CPU.
BibTeX:
@inproceedings{ABDK11,
  author = {Anderson, M. and Ballard, G. and Demmel, J. and Keutzer, K.},
  title = {Communication-Avoiding {QR} Decomposition for {GPU}s},
  booktitle = {Proceedings of the 2011 IEEE International Parallel \& Distributed Processing Symposium},
  publisher = {IEEE Computer Society},
  address = {Washington, DC, USA},
  month = {May},
  year = {2011},
  series = {{IPDPS} '11},
  pages = {48--58},
  url = {http://dx.doi.org/10.1109/IPDPS.2011.15}
}

BibTeX:
@techreport{BDD11,
  author = {Ballard, G. and Demmel, J. and Dumitriu, I.},
  title = {Communication-optimal parallel and sequential eigenvalue and singular value algorithms},
  month = {February},
  year = {2011},
  number = {EECS-2011-14},
  url = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-14.html}
}

Abstract:
As the gap between the cost of communication (i.e., data movement) and computation continues to grow, the importance of pursuing algorithms which minimize communication also increases. Toward this end, we seek asymptotic communication lower bounds for general memory models and classes of algorithms. Recent work has established lower bounds for a wide set of linear algebra algorithms on a sequential machine and on a parallel machine with identical processors. This work extends these previous bounds to a heterogeneous model in which processors access data and perform floating point operations at differing speeds. We also present an algorithm for dense matrix multiplication which attains the lower bound.
BibTeX:
@inproceedings{BDG11,
  author = {Ballard, G. and Demmel, J. and Gearhart, A.},
  title = {Brief announcement: communication bounds for heterogeneous architectures},
  booktitle = {Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures},
  publisher = {ACM},
  address = {San Jose, California, USA},
  month = {June},
  year = {2011},
  series = {SPAA '11},
  pages = {257--258},
  url = {http://doi.acm.org/10.1145/1989493.1989531}
}

Awarded the Best Paper prize.

Abstract:
The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communication costs. For sequential algorithms these bounds are attainable and so optimal.
BibTeX:
@inproceedings{BDHS11-SLB,
  author = {G. Ballard and J. Demmel and O. Holtz and O. Schwartz},
  title = {Graph expansion and communication costs of fast matrix multiplication},
  booktitle = {{Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures}},
  publisher = {ACM},
  address = {San Jose, California, USA},
  month = {June},
  year = {2011},
  series = {SPAA '11},
  pages = {1--12},
  url = {http://doi.acm.org/10.1145/1989493.1989495}
}

BibTeX:
@techreport{BDHS11-TR,
  author = {G. Ballard and J. Demmel and O. Holtz and O. Schwartz},
  title = {Minimizing communication in linear algebra},
  month = {February},
  year = {2011},
  number = {EECS-2011-15},
  url = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-15.html}
}

Awarded the 2009-2011 SIAM Linear Algebra Prize.

Abstract:
In 1981 Hong and Kung proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, n-by-n matrix multiplication using the conventional O(n^3) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo, and Tiskin gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as (#arithmetic_operations/ √ M), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDLT factorization, QR factorization, the Gram--Schmidt algorithm, and algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth cost), we get lower bounds on the number of messages required to move it (latency cost). We extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix) to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or whether we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph-theoretic problems. We point out recently designed algorithms that attain many of these lower bounds.
BibTeX:
@article{BDHS11-GLB,
  author = {G. Ballard and J. Demmel and O. Holtz and O. Schwartz},
  title = {Minimizing Communication in Numerical Linear Algebra},
  journal = {SIAM Journal on Matrix Analysis and Applications},
  publisher = {SIAM},
  month = {September},
  year = {2011},
  volume = {32},
  number = {3},
  pages = {866--901},
  url = {http://epubs.siam.org/doi/abs/10.1137/090769156}
}

Abstract:
The tensor eigenproblem has many important applications, generating both mathematical and application-specific interest in the properties of tensor eigenpairs and methods for computing them. A tensor is an m-way array, generalizing the concept of a matrix (a 2-way array). Kolda and Mayo have recently introduced a generalization of the matrix power method for computing real-valued tensor eigenpairs of symmetric tensors. In this work, we present an efficient implementation of their algorithm, exploiting symmetry in order to save storage, data movement, and computation. For an application involving repeatedly solving the tensor eigenproblem for many small tensors, we describe how a GPU can be used to accelerate the computations. On an NVIDIA Tesla C 2050 (Fermi) GPU, we achieve 318 Gflops/s (31% of theoretical peak performance in single precision) on our test data set.
BibTeX:
@article{BKP11,
  author = {G. Ballard and T. Kolda and T. Plantenga},
  title = {Efficiently Computing Tensor Eigenvalues on a {GPU}},
  journal = {2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum},
  publisher = {IEEE Computer Society},
  address = {Los Alamitos, CA, USA},
  month = {May},
  year = {2011},
  volume = {0},
  pages = {1340--1348},
  url = {https://ieeexplore.ieee.org/document/6008988}
}

Abstract:
Motivated by a problem which arises in the analysis of stagnation point flow toward a stretching sheet, we consider a general class of problems of the form y'''=f(y,y',y''), y(0)=a, y'(0)=m, _t→∞ y'(t) = b. We give conditions on f which imply existence of at least one solution and obtain a partial uniqueness result.
BibTeX:
@article{BB11,
  author = {Baxley, J. and Ballard, G.},
  title = {Existence of Solutions for a Class of Singular Nonlinear Third Order Autonomous Boundary Value Problems},
  journal = {Communications in Applied Analysis},
  month = {Oct},
  year = {2011},
  volume = {15},
  number = {2, 3, and 4},
  pages = {195--202},
  url = {http://www.dynamicpublishers.com/CAA/caa2011.htm}
}

Abstract:
Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n^3)) matrix multiplication to Cholesky factorization, which is used for solving dense symmetric positive definite linear systems. Second, we compare the costs of various Cholesky decomposition implementations to these lower bounds and identify the algorithms and data structures that attain them. In the sequential case, we consider both the two-level and hierarchical memory models. Combined with prior results in [J. Demmel et al., Communication-optimal Parallel and Sequential QR and LU Factorizations, Technical report EECS-2008-89, University of California, Berkeley, CA, 2008], [J. Demmel et al., Implementing Communication-optimal Parallel and Sequential QR and LU Factorizations, SIAM. J. Sci. Comp., submitted], and [J. Demmel, L. Grigori, and H. Xiang, Communication-avoiding Gaussian Elimination, Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, 2008] this gives a set of communication-optimal algorithms for O(n^3) implementations of the three basic factorizations of dense linear algebra: LU with pivoting, QR, and Cholesky. But it goes beyond this prior work on sequential LU by optimizing communication for any number of levels of memory hierarchy.
BibTeX:
@article{BDHS10,
  author = {G. Ballard and J. Demmel and O. Holtz and O. Schwartz},
  title = {Communication-optimal Parallel and Sequential {C}holesky Decomposition},
  journal = {SIAM Journal on Scientific Computing},
  publisher = {SIAM},
  month = {December},
  year = {2010},
  volume = {32},
  number = {6},
  pages = {3495--3523},
  url = {http://epubs.siam.org/doi/abs/10.1137/090760969}
}

BibTeX:
@techreport{BKP10-TR,
  author = {Ballard, G. and Kolda, T. and Plantenga, T.},
  title = {Efficiently Computing Tensor Eigenvalues on a {GPU}},
  month = {December},
  year = {2010},
  url = {http://csri.sandia.gov/Proceedings/}
}

Abstract:
We study the Friedrichs extension for a class of 2nth order ordinary differential operators. These self-adjoint operators have compact inverses and the central problem is to describe their domains in terms of boundary conditions.
BibTeX:
@article{BB09,
  author = {Ballard, G. and Baxley, J.},
  title = {The Friedrichs extension of certain singular differential operators, II},
  journal = {Electronic Journal of Qualitative Theory of Differential Equations},
  month = {Oct},
  year = {2009},
  volume = {Special Edition I},
  number = {5},
  pages = {1--11},
  url = {http://www.emis.ams.org/journals/EJQTDE/sped1/105.html}
}

Abstract:
Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n^3)) matrix multiplication to Cholesky, which is used for solving dense symmetric positive definite linear systems. Second, we compare the cost of various Cholesky implementations to this lower bound, and draw the following conclusions: (1) "Na\ive" sequential algorithms for Cholesky attain neither the bandwidth nor latency lower bounds. (2) The sequential blocked algorithm in LAPACK (with the right block size), as well as various recursive algorithms [AP00, GJ01, AGW01, ST04], and one based on work of Toledo [Tol97], can attain the bandwidth lower bound. (3) The LAPACK algorithm can also attain the latency bound if used with blocked data structures rather than column-wise or row-wise matrix data structures, though the Toledo algorithm cannot. (4) The recursive sequential algorithm due to [AP00] attains the bandwidth and latency lower bounds at every level of a multi-level memory hierarchy, in a "cache-oblivious" way. (5) The parallel implementation of Cholesky in the ScaLAPACK library (again with the right block-size) attains both the bandwidth and latency lower bounds to within a poly-logarithmic factor. Combined with prior results in [DGHL08a, DGHL08b, DGX08] this gives a complete set of communication-optimal algorithms for O(n^3) implementations of three basic factorizations of dense linear algebra: LU with pivoting, QR and Cholesky. But it goes beyond this prior work on sequential LU and QR by optimizing communication for any number of levels of memory hierarchy.
BibTeX:
@inproceedings{BDHS09,
  author = {Ballard, G. and Demmel, J. and Holtz, O. and Schwartz, O.},
  title = {Communication-optimal parallel and sequential {C}holesky decomposition},
  booktitle = {{Proceedings of the 22nd Symposium on Parallelism in Algorithms and Architectures}},
  publisher = {ACM},
  address = {Calgary, AB, Canada},
  month = {June},
  year = {2009},
  series = {SPAA '09},
  pages = {245--252},
  url = {http://doi.acm.org/10.1145/1583991.1584054}
}

BibTeX:
@mastersthesis{Ballard08,
  author = {Ballard, G.},
  title = {Asymptotic behavior of the eigenvalues of Toeplitz integral operators associated with the Hankel transform},
  month = {May},
  year = {2008},
  url = {http://wakespace.lib.wfu.edu/handle/10339/14676}
}

Abstract:
We consider boundary value problems of the form y''=-f(t,y), y(0)=0, y(1)=0, motivated by examples where f(t,y)=(t)g(y) and g(y) behave like y^-λ (>0) as y→ 0^+. We explore conditions under which such problems have multiple positive solutions, investigate qualitative behavior of these solutions, and discuss computational methods for approximating the solutions.
BibTeX:
@article{BB08,
  author = {Ballard, G. and Baxley, J.},
  title = {Qualitative behavior and computation of multiple solutions of singular nonlinear boundary value problems},
  journal = {Involve},
  month = {Feb},
  year = {2008},
  volume = {1},
  number = {1},
  pages = {21--31},
  url = {http://msp.org/involve/2008/1-1/p03.xhtml}
}

Abstract:
In an interdisciplinary effort to model protein dependency networks, biologists measure signals from certain proteins within cells over a given interval of time. Using this time series data, the goal is to deduce protein dependency relationships. The mathematical challenge is to statistically measure correlations between given proteins over time in order to conjecture probable relationships. Biologists can then consider these relationships with more scrutiny, in order to confirm their conjectures. One algorithm for finding such relationships makes use of interpolation of the data to produce next-state functions for each protein and the Deegan-Packel Index of Power voting method to measure the strength of correlations between pairs of proteins. The algorithm was previously implemented, but limitations associated with the original language required the algorithm to be re-implemented in a more computationally efficient language. Because of the algebraic focus of the Computational Commutative Algebra language, or CoCoA, the algorithm was re-implemented in this language, and results have been produced much more efficiently. In this paper I discuss the algorithm, the CoCoA language, the implementation of the algorithm in CoCoA, and the quality of the results.
BibTeX:
@article{Ballard06,
  author = {Ballard, G.},
  title = {Modeling Protein Dependency Networks Using CoCoA},
  journal = {Crossroads},
  publisher = {ACM},
  address = {New York, NY, USA},
  month = {September},
  year = {2006},
  volume = {13},
  number = {1},
  pages = {6--6},
  url = {http://doi.acm.org/10.1145/1217666.1217672}
}

Abstract:
We consider nonlinear boundary value problems with multiple solutions. A method is proposed for the computation of such solutions which depends crucially on known a priori qualitative information about the behavior of the solutions. The method is a two-stage method where the second stage is a shooting method and initial values of the shooting parameters are found in a first stage which approximates the boundary value problem with a discrete approximation. Both nonsingular and singular problems are considered.
BibTeX:
@article{BBN06,
  author = {Ballard, G. and Baxley, J. and Libbus, N.},
  title = {Qualitative behavior and computation of multiple solutions of nonlinear boundary value problems},
  journal = {Communications on Pure and Applied Analysis},
  month = {Mar},
  year = {2006},
  volume = {5},
  number = {2},
  pages = {251--259},
  url = {http://www.aimsciences.org/journals/displayArticles.jsp?paperID=1769}
}

Created by JabRef on 25/06/2023.