G. Ballard,
| |||||

BibTeX:
@article{Ballard19, author = {Grey Ballard}, title = {CP and Tucker Tensor Decompositions}, journal = {SIAG/OPT Views and News}, month = {January}, year = {2019}, volume = {27}, number = {1}, pages = {1--7}, url = {http://wiki.siam.org/siag-op/images/siag-op/b/bd/ViewsAndNews-27-1.pdf} } |
|||||

G. Ballard, A. Klinvex and T. G. Kolda,
| |||||

Abstract: Our goal is compression of massive-scale grid-structured data, such as the multi-terabyte output of a high-fidelity computational simulation. For such data sets, we have developed a new software package called TuckerMPI, a parallel C++/MPI software package for compressing distributed data. The approach is based on treating the data as a tensor, i.e., a multidimensional array, and computing its truncated Tucker decomposition, a higher-order analogue to the truncated singular value decomposition of a matrix. The result is a low-rank approximation of the original tensor-structured data. Compression efficiency is achieved by detecting latent global structure within the data, which we contrast to most compression methods that are focused on local structure. In this work, we describe TuckerMPI, our implementation of the truncated Tucker decomposition, including details of the data distribution and in-memory layouts, the parallel and serial implementations of the key kernels, and analysis of the storage, communication, and computational costs. We test the software on 4.5 terabyte and 6.7 terabyte data sets distributed across 100s of nodes (1000s of MPI processes), achieving compression rates between 100-200,000x which equates to 99-99.999% compression (depending on the desired accuracy) in substantially less time than it would take to even read the same dataset from a parallel filesystem. Moreover, we show that our method also allows for reconstruction of partial or down-sampled data on a single node, without a parallel computer so long as the reconstructed portion is small enough to fit on a single machine, e.g., in the instance of reconstructing/visualizing a single down-sampled time step or computing summary statistics. | |||||

BibTeX:
@techreport{BKK19, author = {Grey Ballard and Alicia Klinvex and Tamara G. Kolda}, title = {{TuckerMPI}: Efficient parallel software for {Tucker} decompositions of dense tensors}, year = {2019}, number = {1901.06043}, url = {https://arxiv.org/abs/1901.06043} } |
|||||

G. Ballard, J. Demmel, L. Grigori, M. Jacquelin and N. Knight,
technical report}
| |||||

Abstract: Interprocessor communication often dominates the runtime of large matrix computations. We present a parallel algorithm for computing QR decompositions whose bandwidth cost (communication volume) can be decreased at the cost of increasing its latency cost (number of messages). By varying a parameter to navigate the bandwidth/latency tradeoff, we can tune this algorithm for machines with different communication costs. | |||||

BibTeX:
@inproceedings{BDGJK18, author = {Grey Ballard and James Demmel and Laura Grigori and Mathias Jacquelin and Nicholas Knight}, title = {A {3D} Parallel Algorithm for {QR} Decomposition}, booktitle = {Proceedings of the 30th Symposium on Parallelism in Algorithms and Architectures}, publisher = {ACM}, address = {New York, NY, USA}, year = {2018}, series = {SPAA '18}, pages = {55--65}, url = {http://doi.acm.org/10.1145/3210377.3210415} } |
|||||

G. Ballard, J. Demmel, L. Grigori, M. Jacquelin and N. Knight,
published version}
| |||||

Abstract: Interprocessor communication often dominates the runtime of large matrix computations. We present a parallel algorithm for computing QR decompositions whose bandwidth cost (communication volume) can be decreased at the cost of increasing its latency cost (number of messages). By varying a parameter to navigate the bandwidth/latency tradeoff, we can tune this algorithm for machines with different communication costs. | |||||

BibTeX:
@techreport{BDGJK18-TR, author = {Grey Ballard and James Demmel and Laura Grigori and Mathias Jacquelin and Nicholas Knight}, title = {A 3D Parallel Algorithm for QR Decomposition}, year = {2018}, number = {1805.05278}, url = {https://arxiv.org/abs/1805.05278} } |
|||||

G. Ballard, K. Hayashi and R. Kannan,
| |||||

BibTeX:
@techreport{BHK18-TR, author = {Grey Ballard and Koby Hayashi and Ramakrishnan Kannan}, title = {Parallel Nonnegative {CP} Decomposition of Dense Tensors}, year = {2018}, number = {1806.07985}, note = {To appear in the 25th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 18).}, url = {https://arxiv.org/abs/1806.07985} } |
|||||

G. Ballard, C. Ikenmeyer, J. Landsberg and N. Ryder,
published version}
| |||||

Abstract: This is the second in a series of papers on rank decompositions of the matrix multiplication tensor. We present new rank 23 decompositions for the 3 × 3 matrix multiplication tensor M_⟨ 3 ⟩. All our decompositions have symmetry groups that include the standard cyclic permutation of factors but otherwise exhibit a range of behavior. One of them has 11 cubes as summands and admits an unexpected symmetry group of order 12. We establish basic information regarding symmetry groups of decompositions and outline two approaches for finding new rank decompositions of M_⟨ n ⟩ for larger n. | |||||

BibTeX:
@techreport{BILR18-TR, author = {Grey Ballard and Christian Ikenmeyer and J.M. Landsberg and Nick Ryder}, title = {The geometry of rank decompositions of matrix multiplication {II}: 3x3 matrices}, year = {2018}, number = {1801.00843}, url = {https://arxiv.org/abs/1801.00843} } |
|||||

G. Ballard, C. Ikenmeyer, J. Landsberg and N. Ryder,
technical report}
| |||||

Abstract: This is the second in a series of papers on rank decompositions of the matrix multiplication tensor. We present new rank 23 decompositions for the 3 matrix multiplication tensor M_⟨ 3 ⟩. All our decompositions have symmetry groups that include the standard cyclic permutation of factors but otherwise exhibit a range of behavior. One of them has 11 cubes as summands and admits an unexpected symmetry group of order 12. We establish basic information regarding symmetry groups of decompositions and outline two approaches for finding new rank decompositions of M_⟨ n ⟩ for larger n. | |||||

BibTeX:
@article{BILR18, author = {Grey Ballard and Christian Ikenmeyer and J.M. Landsberg and Nick Ryder}, title = {The geometry of rank decompositions of matrix multiplication II: 3x3 matrices}, journal = {Journal of Pure and Applied Algebra}, year = {2018}, url = {http://www.sciencedirect.com/science/article/pii/S0022404918302548} } |
|||||

G. Ballard, N. Knight and K. Rouse,
technical report}
| |||||

Abstract: The matricized-tensor times Khatri-Rao product (MTTKRP) computation is the typical bottleneck in algorithms for computing a CP decomposition of a tensor. In order to develop high performance sequential and parallel algorithms, we establish communication lower bounds that identify how much data movement is required for this computation in the case of dense tensors. We also present sequential and parallel algorithms that attain the lower bounds and are therefore communication optimal. In particular, we show that the structure of the computation allows for less communication than the straightforward approach of casting the computation as a matrix multiplication operation. | |||||

BibTeX:
@inproceedings{BKR18, author = {Grey Ballard and Nicholas Knight and Kathryn Rouse}, title = {Communication Lower Bounds for Matricized Tensor Times {Khatri-Rao} Product}, booktitle = {Proceedings of the 32nd IEEE International Parallel and Distributed Processing Symposium}, month = {May}, year = {2018}, pages = {557-567}, url = {https://ieeexplore.ieee.org/document/8425209/} } |
|||||

C. Battaglino, G. Ballard and T. G. Kolda,
technical report}
| |||||

Abstract: The CANDECOMP/PARAFAC (CP) decomposition is a leading method for the analysis of multiway data. The standard alternating least squares algorithm for the CP decomposition (CP-ALS) involves a series of highly overdetermined linear least squares problems. We extend randomized least squares methods to tensors and show that the workload of CP-ALS can be drastically reduced without a sacrifice in quality. We introduce techniques for efficiently preprocessing, sampling, and computing randomized least squares on a dense tensor of arbitrary order, as well as an efficient sampling-based technique for checking the stopping condition. We also show more generally that the Khatri--Rao product (used within the CP-ALS iteration) produces conditions favorable for direct sampling. In numerical results, we see improvements in speed, reductions in memory requirements, and robustness with respect to initialization. | |||||

BibTeX:
@article{BBK18, author = {Casey Battaglino and Grey Ballard and Tamara G. Kolda}, title = {A Practical Randomized CP Tensor Decomposition}, journal = {SIAM Journal on Matrix Analysis and Applications}, year = {2018}, volume = {39}, number = {2}, pages = {876-901}, url = {https://doi.org/10.1137/17M1112303} } |
|||||

K. Hayashi, G. Ballard, Y. Jiang and M. J. Tobia,
technical report}
| |||||

Abstract: The matricized-tensor times Khatri-Rao product (MTTKRP) is the computational bottleneck for algorithms computing CP decompositions of tensors. In this work, we develop shared-memory parallel algorithms for MTTKRP involving dense tensors. The algorithms cast nearly all of the computation as matrix operations in order to use optimized BLAS subroutines, and they avoid reordering tensor entries in memory. We use our parallel implementation to compute a CP decomposition of a neuroimaging data set and achieve a speedup of up to 7.4X over existing parallel software. | |||||

BibTeX:
@inproceedings{HBJT18, author = {Hayashi, Koby and Ballard, Grey and Jiang, Yujie and Tobia, Michael J.}, title = {Extended Abstract: Shared-memory Parallelization of MTTKRP for Dense Tensors}, booktitle = {Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming}, publisher = {ACM}, address = {New York, NY, USA}, year = {2018}, series = {PPoPP '18}, pages = {393--394}, url = {http://doi.acm.org/10.1145/3178487.3178522} } |
|||||

R. Kannan, G. Ballard and H. Park,
| |||||

Abstract: Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors W and H, for the given input matrix A, such that WH approximates A. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel algorithms to solve the problem for big data sets. The main contribution of this work is a new, high-performance parallel computational framework for a broad class of NMF algorithms that iteratively solves alternating non-negative least squares (NLS) subproblems for mathbfW and mathbfH . It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). The framework is flexible and able to leverage- a variety of NMF and NLS algorithms, including Multiplicative Update, Hierarchical Alternating Least Squares, and Block Principal Pivoting. Our implementation allows us to benchmark and compare different algorithms on massive dense and sparse data matrices of size that spans from few hundreds of millions to billions. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements. The code and the datasets used for conducting the experiments are available online. | |||||

BibTeX:
@article{KBP17, author = {Ramakrishnan Kannan and Grey Ballard and Haesun Park}, title = {{MPI-FAUN}: An {MPI}-Based Framework for Alternating-Updating Nonnegative Matrix Factorization}, journal = {IEEE Transactions on Knowledge and Data Engineering}, month = {March}, year = {2018}, volume = {30}, number = {3}, pages = {544-558}, url = {https://www.computer.org/csdl/trans/tk/2018/03/08089433-abs.html} } |
|||||

O. Kaya, R. Kannan and G. Ballard,
| |||||

Abstract: Non-negative matrix factorization (NMF), the problem of finding two non-negative low-rank factors whose product approximates an input matrix, is a useful tool for many data mining and scientific applications such as topic modeling in text mining and unmixing in microscopy. In this paper, we focus on scaling algorithms for NMF to very large sparse datasets and massively parallel machines by employing effective algorithms, communication patterns, and partitioning schemes that leverage the sparsity of the input matrix. We consider two previous works developed for related problems,one that uses a fine-grained partitioning strategy using a point-to-point communication pattern and one that uses a Cartesian, or checkerboard, partitioning strategy using a collective-based communication pattern. We show that a combination of the previous approaches balances the demands of the various computations within NMF algorithms and achieves high efficiency and scalability. From the experiments, we see that our proposed strategy runs up to 10x faster than the state of the art on real-world datasets. | |||||

BibTeX:
@inproceedings{KKB18, author = {Oguz Kaya and Ramakrishnan Kannan and Grey Ballard}, title = {Partitioning and Communication Strategies for Sparse Non-negative Matrix Factorization}, booktitle = {Proceedings of the 47th International Conference on Parallel Processing}, publisher = {ACM}, address = {New York, NY, USA}, year = {2018}, series = {ICPP 2018}, pages = {90:1--90:10}, url = {https://dl.acm.org/citation.cfm?id=3225127} } |
|||||

F. Mokhtari, P. J. Laurienti, W. J. Rejeski and G. Ballard,
| |||||

Abstract: There is a growing interest in using so-called dynamic functional connectivity, as the conventional static brain connectivity models are being questioned. Brain network analyses yield complex network data that are difficult to analyze and interpret. To deal with the complex structures, decomposition/factorization techniques that simplify the data are often used. For dynamic network analyses, data simplification is of even greater importance, as dynamic connectivity analyses result in a time series of complex networks. A new challenge that must be faced when using these decomposition/factorization techniques is how to interpret the resulting connectivity patterns. Connectivity patterns resulting from decomposition analyses are often visualized as networks in brain space, in the same way that pairwise correlation networks are visualized. This elevates the risk of conflating connections between nodes that represent correlations between nodes' time series with connections between nodes that result from decomposition analyses. Moreover, dynamic connectivity data may be represented with three-dimensional or four-dimensional (4D) tensors and decomposition results require unique interpretations. Thus, the primary goal of this article is to (1) address the issues that must be considered when interpreting the connectivity patterns from decomposition techniques and (2) show how the data structure and decomposition method interact to affect this interpretation. The outcome of our analyses is summarized as follows. (1) The edge strength in decomposition connectivity patterns represents complex relationships not pairwise interactions between the nodes. (2) The structure of the data significantly alters the connectivity patterns, for example, 4D data result in connectivity patterns with higher regional connections. (3) Orthogonal decomposition methods outperform in feature reduction applications, whereas nonorthogonal decomposition methods are better for mechanistic interpretation. | |||||

BibTeX:
@article{MLRB18, author = {Mokhtari, Fatemeh and Laurienti, Paul J and Rejeski, W Jack and Ballard, Grey}, title = {Dynamic Functional Magnetic Resonance Imaging Connectivity Tensor Decomposition: A New Approach to Analyze and Interpret Dynamic Brain Connectivity}, journal = {Brain connectivity}, month = {December}, year = {2018}, url = {https://doi.org/10.1089/brain.2018.0605} } |
|||||

G. Ballard, N. Knight and K. Rouse,
published version}
| |||||

Abstract: The matricized-tensor times Khatri-Rao product (MTTKRP) computation is the typical bottleneck in algorithms for computing a CP decomposition of a tensor. In order to develop high performance sequential and parallel algorithms, we establish communication lower bounds that identify how much data movement is required for this computation in the case of dense tensors. We also present sequential and parallel algorithms that attain the lower bounds and are therefore communication optimal. In particular, we show that the structure of the computation allows for less communication than the straightforward approach of casting the computation as a matrix multiplication operation. | |||||

BibTeX:
@techreport{BKR17-TR, author = {Grey Ballard and Nicholas Knight and Kathryn Rouse}, title = {Communication Lower Bounds for Matricized Tensor Times {Khatri-Rao} Product}, year = {2017}, number = {1708.07401}, url = {https://arxiv.org/abs/1708.07401} } |
|||||

C. Battaglino, G. Ballard and T. G. Kolda,
published version}
| |||||

Abstract: The CANDECOMP/PARAFAC (CP) decomposition is a leading method for the analysis of multiway data. The standard alternating least squares algorithm for the CP decomposition (CP-ALS) involves a series of highly overdetermined linear least squares problems. We extend randomized least squares methods to tensors and show the workload of CP-ALS can be drastically reduced without a sacrifice in quality. We introduce techniques for efficiently preprocessing, sampling, and computing randomized least squares on a dense tensor of arbitrary order, as well as an efficient sampling-based technique for checking the stopping condition. We also show more generally that the Khatri-Rao product (used within the CP-ALS iteration) produces conditions favorable for direct sampling. In numerical results, we see improvements in speed, reductions in memory requirements, and robustness with respect to initialization. | |||||

BibTeX:
@techreport{BBK17-TR, author = {Casey Battaglino and Grey Ballard and Tamara G. Kolda}, title = {A Practical Randomized CP Tensor Decomposition}, year = {2017}, number = {1701.06600}, url = {https://arxiv.org/abs/1701.06600} } |
|||||

K. Hayashi, G. Ballard, J. Jiang and M. J. Tobia,
published version}
| |||||

Abstract: The matricized-tensor times Khatri-Rao product (MTTKRP) is the computational bottleneck for algorithms computing CP decompositions of tensors. In this paper, we develop shared-memory parallel algorithms for MTTKRP involving dense tensors. The algorithms cast nearly all of the computation as matrix operations in order to use optimized BLAS subroutines, and they avoid reordering tensor entries in memory. We benchmark sequential and parallel performance of our implementations, demonstrating high sequential performance and efficient parallel scaling. We use our parallel implementation to compute a CP decomposition of a neuroimaging data set and achieve a speedup of up to 7.4x over existing parallel software. | |||||

BibTeX:
@techreport{HBJT17-TR, author = {Koby Hayashi and Grey Ballard and Jeffrey Jiang and Michael J. Tobia}, title = {Shared Memory Parallelization of MTTKRP for Dense Tensors}, year = {2017}, number = {1708.08976}, url = {https://arxiv.org/abs/1708.08976} } |
|||||

E. Solomonik, G. Ballard, J. Demmel and T. Hoefler,
technical report}
| |||||

Abstract: Many large-scale scientific computations require eigenvalue solvers in a scaling regime where efficiency is limited by data movement. We introduce a parallel algorithm for computing the eigenvalues of a dense symmetric matrix, which performs asymptotically less communication than previously known approaches. We provide analysis in the Bulk Synchronous Parallel (BSP) model with additional consideration for communication between a local memory and cache. Given sufficient memory to store c copies of the symmetric matrix, our algorithm requires c) less interprocessor communication than previously known algorithms, for any c≤ p^1/3 when using p processors. The algorithm first reduces the dense symmetric matrix to a banded matrix with the same eigenvalues. Subsequently, the algorithm employs successive reduction to O(log p) thinner banded matrices. We employ two new parallel algorithms that achieve lower communication costs for the full-to-band and band-to-band reductions. Both of these algorithms leverage a novel QR factorization algorithm for rectangular matrices. | |||||

BibTeX:
@inproceedings{SBDH17, author = {Solomonik, Edgar and Ballard, Grey and Demmel, James and Hoefler, Torsten}, title = {A Communication-Avoiding Parallel Algorithm for the Symmetric Eigenvalue Problem}, booktitle = {Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures}, publisher = {ACM}, address = {New York, NY, USA}, year = {2017}, series = {SPAA '17}, pages = {111--121}, url = {http://doi.acm.org/10.1145/3087556.3087561} } |
|||||

M. J. Tobia, K. Hayashi, G. Ballard, I. H. Gotlib and C. E. Waugh,
| |||||

Abstract: Exposure to acute stress induces multiple emotional responses, each with their own unique temporal dynamics. Dynamic functional connectivity (dFC) measures the temporal variability of network synchrony and captures individual differences in network neurodynamics. This study investigated the relationship between dFC and individual differences in emotions induced by an acute psychosocial stressor. Sixteen healthy adult women underwent fMRI scanning during a social evaluative threat (SET) task, and retrospectively completed questionnaires that assessed individual differences in subjectively experienced positive and negative emotions about stress and stress relief during the task. Group dFC was decomposed with parallel factor analysis (PARAFAC) into 10 components, each with a temporal signature, spatial network of functionally connected regions, and vector of participant loadings that captures individual differences in dFC. Participant loadings of two networks were positively correlated with stress-related emotions, indicating the existence of networks for positive and negative emotions. The emotion-related networks involved the ventromedial prefrontal cortex, cingulate cortex, anterior insula, and amygdala, among other distributed brain regions, and time signatures for these emotion-related networks were uncorrelated. These findings demonstrate that individual differences in stress-induced positive and negative emotions are each uniquely associated with large-scale brain networks, and suggest that dFC is a mechanism that generates individual differences in the emotional components of the stress response. | |||||

BibTeX:
@article{THBGW17, author = {Tobia, Michael J. and Hayashi, Koby and Ballard, Grey and Gotlib, Ian H. and Waugh, Christian E.}, title = {Dynamic functional connectivity and individual differences in emotions during social stress}, journal = {Human Brain Mapping}, year = {2017}, volume = {38}, number = {12}, pages = {6185--6205}, url = {http://dx.doi.org/10.1002/hbm.23821} } |
|||||

W. Austin, G. Ballard and T. G. Kolda,
technical report}
| |||||

Abstract: As parallel computing trends towards the exascale, scientific data produced by high-fidelity simulations are growing increasingly massive. For instance, a simulation on a three-dimensional spatial grid with 512 points per dimension that tracks 64 variables per grid point for 128 time steps yields 8 TB of data, assuming double precision. By viewing the data as a dense five-way tensor, we can compute a Tucker decomposition to find inherent low-dimensional multilinear structure, achieving compression ratios of up to 5000 on real-world data sets with negligible loss in accuracy. So that we can operate on such massive data, we present the first-ever distributed-memory parallel implementation for the Tucker decomposition, whose key computations correspond to parallel linear algebra operations, albeit with nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel. We provide accompanying analysis of the computation and communication costs of the algorithms. To demonstrate the compression and accuracy of the method, we apply our approach to real-world data sets from combustion science simulations. We also provide detailed performance results, including parallel performance in both weak and strong scaling experiments. | |||||

BibTeX:
@inproceedings{ABK16, author = {Woody Austin and Grey Ballard and Tamara G. Kolda}, title = {Parallel Tensor Compression for Large-Scale Scientific Data}, booktitle = {Proceedings of the 30th IEEE International Parallel and Distributed Processing Symposium}, month = {May}, year = {2016}, pages = {912--922}, url = {https://www.computer.org/csdl/proceedings/ipdps/2016/2140/00/2140a912-abs.html} } |
|||||

A. Azad, G. Ballard, A. Buluç, J. Demmel, L. Grigori, O. Schwartz, S. Toledo and S. Williams,
technical report}
| |||||

Abstract: Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdos--Renyi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first implementation of the 3D SpGEMM formulation that exploits multiple (intranode and internode) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research. | |||||

BibTeX:
@article{AB+16, author = {Ariful Azad and Grey Ballard and Aydin Bulu\c{c} and James Demmel and Laura Grigori and Oded Schwartz and Sivan Toledo and Samuel Williams}, title = {Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication}, journal = {SIAM Journal on Scientific Computing}, year = {2016}, volume = {38}, number = {6}, pages = {C624-C651}, url = {http://dx.doi.org/10.1137/15M104253X} } |
|||||

G. Ballard, A. R. Benson, A. Druinsky, B. Lipshitz and O. Schwartz,
technical report}
| |||||

Abstract: Fast algorithms for matrix multiplication, namely those that perform asymptotically fewer scalar operations than the classical algorithm, have been considered primarily of theoretical interest. Apart from Strassen's original algorithm, few fast algorithms have been efficiently implemented or used in practical applications. However, there exist many practical alternatives to Strassen's algorithm with varying performance and numerical properties. Fast algorithms are known to be numerically stable, but because their error bounds are slightly weaker than the classical algorithm, they are not used even in cases where they provide a performance benefit. We argue in this paper that the numerical sacrifice of fast algorithms, particularly for the typical use cases of practical algorithms, is not prohibitive, and we explore ways to improve the accuracy both theoretically and empirically. The numerical accuracy of fast matrix multiplication depends on properties of the algorithm and of the input matrices, and we consider both contributions independently. We generalize and tighten previous error analyses of fast algorithms and compare their properties. We discuss algorithmic techniques for improving the error guarantees from two perspectives: manipulating the algorithms, and reducing input anomalies by various forms of diagonal scaling. Finally, we benchmark performance and demonstrate our improved numerical accuracy. | |||||

BibTeX:
@article{BBDLS16, author = {Grey Ballard and Austin R. Benson and Alex Druinsky and Benjamin Lipshitz and Oded Schwartz}, title = {Improving the Numerical Stability of Fast Matrix Multiplication}, journal = {SIAM Journal on Matrix Analysis and Applications}, year = {2016}, volume = {37}, number = {4}, pages = {1382-1418}, url = {http://dx.doi.org/10.1137/15M1032168} } |
|||||

G. Ballard, J. Demmel, A. Gearhart, B. Lipshitz, Y. Oltchik, O. Schwartz and S. Toledo,
| |||||

Abstract: Network topologies can have significant effect on the execution costs of parallel algorithms due to inter-processor communication. For particular combinations of computations and network topologies, costly network contention may inevitably become a bottleneck, even if algorithms are optimally designed so that each processor communicates as little as possible. We obtain novel contention lower bounds that are functions of the network and the computation graph parameters. For several combinations of fundamental computations and common network topologies, our new analysis improves upon previous per-processor lower bounds which only specify the number of words communicated by the busiest individual processor. We consider torus and mesh topologies, universal fat-trees, and hypercubes; algorithms covered include classical matrix multiplication and direct numerical linear algebra, fast matrix multiplication algorithms, programs that reference arrays, N-body computations, and the FFT. For example, we show that fast matrix multiplication algorithms (e.g., Strassen's) running on a 3D torus will suffer from contention bottlenecks. On the other hand, this network is likely sufficient for a classical matrix multiplication algorithm. Our new lower bounds are matched by existing algorithms only in very few cases, leaving many open problems for network and algorithmic design. | |||||

BibTeX:
@inproceedings{BD+16, author = {Ballard, Grey and Demmel, James and Gearhart, Andrew and Lipshitz, Benjamin and Oltchik, Yishai and Schwartz, Oded and Toledo, Sivan}, title = {Network Topologies and Inevitable Contention}, booktitle = {Proceedings of the First Workshop on Optimization of Communication in HPC}, publisher = {IEEE Press}, address = {Piscataway, NJ, USA}, year = {2016}, series = {COM-HPC '16}, pages = {39--52}, url = {http://dl.acm.org/citation.cfm?id=3018063} } |
|||||

G. Ballard, A. Druinsky, N. Knight and O. Schwartz,
technical report}
| |||||

Abstract: We propose a fine-grained hypergraph model for sparse matrix-matrix multiplication (SpGEMM), a key computational kernel in scientific computing and data analysis whose performance is often communication bound. This model correctly describes both the interprocessor communication volume along a critical path in a parallel computation and also the volume of data moving through the memory hierarchy in a sequential computation. We show that identifying a communication-optimal algorithm for particular input matrices is equivalent to solving a hypergraph partitioning problem. Our approach is nonzero structure dependent, meaning that we seek the best algorithm for the given input matrices. In addition to our three-dimensional fine-grained model, we also propose coarse-grained one-dimensional and two-dimensional models that correspond to simpler SpGEMM algorithms. We explore the relations between our models theoretically, and we study their performance experimentally in the context of three applications that use SpGEMM as a key computation. For each application, we find that at least one coarse-grained model is as communication efficient as the fine-grained model. We also observe that different applications have affinities for different algorithms. Our results demonstrate that hypergraphs are an accurate model for reasoning about the communication costs of SpGEMM as well as a practical tool for exploring the SpGEMM algorithm design space. | |||||

BibTeX:
@article{BDKS16, author = {Ballard, Grey and Druinsky, Alex and Knight, Nicholas and Schwartz, Oded}, title = {Hypergraph Partitioning for Sparse Matrix-Matrix Multiplication}, journal = {ACM Transactions on Parallel Computing}, publisher = {ACM}, address = {New York, NY, USA}, month = {December}, year = {2016}, volume = {3}, number = {3}, pages = {18:1--18:34}, url = {http://doi.acm.org/10.1145/3015144} } |
|||||

G. Ballard, A. Druinsky, N. Knight and O. Schwartz,
published version}
| |||||

Abstract: We propose a fine-grained hypergraph model for sparse matrix-matrix multiplication (SpGEMM), a key computational kernel in scientific computing and data analysis whose performance is often communication bound. This model correctly describes both the interprocessor communication volume along a critical path in a parallel computation and also the volume of data moving through the memory hierarchy in a sequential computation. We show that identifying a communication-optimal algorithm for particular input matrices is equivalent to solving a hypergraph partitioning problem. Our approach is sparsity dependent, meaning that we seek the best algorithm for the given input matrices. In addition to our (3D) fine-grained model, we also propose coarse-grained 1D and 2D models that correspond to simpler SpGEMM algorithms. We explore the relations between our models theoretically, and we study their performance experimentally in the context of three applications that use SpGEMM as a key computation. For each application, we find that at least one coarse-grained model is as communication efficient as the fine-grained model. We also observe that different applications have affinities for different algorithms. Our results demonstrate that hypergraphs are an accurate model for reasoning about the communication costs of SpGEMM as well as a practical tool for exploring the SpGEMM algorithm design space. | |||||

BibTeX:
@techreport{BDKS16-TR, author = {Grey Ballard and Alex Druinsky and Nicholas Knight and Oded Schwartz}, title = {Hypergraph Partitioning for Sparse Matrix-Matrix Multiplication}, month = {March}, year = {2016}, number = {1603.05627}, url = {http://arxiv.org/abs/1603.05627} } |
|||||

G. Ballard, C. Siefert and J. Hu,
technical report}
| |||||

Abstract: We consider the sequence of sparse matrix-matrix multiplications performed during the setup phase of algebraic multigrid. In particular, we show that the most commonly used parallel algorithm is often not the most communication-efficient one for all of the matrix-matrix multiplications involved. By using an alternative algorithm, we show that the communication costs are reduced (in theory and practice), and we demonstrate the performance benefit for both model (structured) and more realistic unstructured problems on large-scale distributed-memory parallel systems. Our theoretical analysis shows that we can reduce communication by a factor of up to 5.4 for a model problem, and we observe in our empirical evaluation communication reductions of factors up to 4.7 for structured problems and 3.7 for unstructured problems. These reductions in communication translate to run-time speedups of factors up to 2.8 and 2.5, respectively. | |||||

BibTeX:
@article{BSH16, author = {Ballard, Grey and Siefert, Christopher and Hu, Jonathan}, title = {Reducing Communication Costs for Sparse Matrix Multiplication within Algebraic Multigrid}, journal = {SIAM Journal on Scientific Computing}, month = {June}, year = {2016}, volume = {38}, number = {3}, pages = {C203-C231}, url = {http://epubs.siam.org/doi/abs/10.1137/15M1028807} } |
|||||

R. Kannan, G. Ballard and H. Park,
technical report}
| |||||

Abstract: Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors W and H, for the given input matrix A, such that WH approximates A. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient distributed algorithms to solve the problem for big data sets. We propose a high-performance distributed-memory parallel algorithm that computes the factorization by iteratively solving alternating non-negative least squares (NLS) subproblems for W and H. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). As opposed to previous implementations, our algorithm is also flexible: (1) it performs well for both dense and sparse matrices, and (2) it allows the user to choose any one of the multiple algorithms for solving the updates to low rank factors W and H within the alternating iterations. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements. | |||||

BibTeX:
@inproceedings{KBP16, author = {Kannan, Ramakrishnan and Ballard, Grey and Park, Haesun}, title = {A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization}, booktitle = {Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming}, publisher = {ACM}, address = {New York, NY, USA}, month = {February}, year = {2016}, series = {PPoPP '16}, pages = {9:1--9:11}, url = {http://doi.acm.org/10.1145/2851141.2851152} } |
|||||

R. Kannan, G. Ballard and H. Park,
| |||||

Abstract: Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors W and H, for the given input matrix A, such that A is approximated by WH. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel algorithms to solve the problem for big data sets. The main contribution of this work is a new, high-performance parallel computational framework for a broad class of NMF algorithms that iteratively solves alternating non-negative least squares (NLS) subproblems for W and H. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). The framework is flexible and able to leverage a variety of NMF and NLS algorithms, including Multiplicative Update, Hierarchical Alternating Least Squares, and Block Principal Pivoting. Our implementation allows us to benchmark and compare different algorithms on massive dense and sparse data matrices of size that spans for few hundreds of millions to billions. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements. The code and the datasets used for conducting the experiments are available online. | |||||

BibTeX:
@techreport{KBP16-TR, author = {Ramakrishnan Kannan and Grey Ballard and Haesun Park}, title = {MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization}, month = {September}, year = {2016}, number = {1609.09154}, url = {https://arxiv.org/abs/1609.09154} } |
|||||

E. Solomonik, G. Ballard, J. Demmel and T. Hoefler,
published version}
| |||||

Abstract: Many large-scale scientific computations require eigenvalue solvers in a scaling regime where efficiency is limited by data movement. We introduce a parallel algorithm for computing the eigenvalues of a dense symmetric matrix, which performs asymptotically less communication than previously known approaches. We provide analysis in the Bulk Synchronous Parallel (BSP) model with additional consideration for communication between a local memory and cache. Given sufficient memory to store c copies of the symmetric matrix, our algorithm requires c) less interprocessor communication than previously known algorithms, for any c≤ p^1/3 when using p processors. The algorithm first reduces the dense symmetric matrix to a banded matrix with the same eigenvalues. Subsequently, the algorithm employs successive reduction to O(log p) thinner banded matrices. We employ two new parallel algorithms that achieve lower communication costs for the full-to-band and band-to-band reductions. Both of these algorithms leverage a novel QR factorization algorithm for rectangular matrices. | |||||

BibTeX:
@techreport{SBDH16-TR, author = {Solomonik, Edgar and Ballard, Grey and Demmel, James and Hoefler, Torsten}, title = {A communication-avoiding parallel algorithm for the symmetric eigenvalue problem}, month = {April}, year = {2016}, number = {1604.03703}, url = {http://arxiv.org/abs/1604.03703} } |
|||||

W. Austin, G. Ballard and T. G. Kolda,
published version}
| |||||

Abstract: As parallel computing trends towards the exascale, scientific data produced by high-fidelity simulations are growing increasingly massive. For instance, a simulation on a three-dimensional spatial grid with 512 points per dimension that tracks 64 variables per grid point for 128 time steps yields 8 TB of data, assuming double precision. By viewing the data as a dense five-way tensor, we can compute a Tucker decomposition to find inherent low-dimensional multilinear structure, achieving compression ratios of up to 5000 on real-world data sets with negligible loss in accuracy. So that we can operate on such massive data, we present the first-ever distributed-memory parallel implementation for the Tucker decomposition, whose key computations correspond to parallel linear algebra operations, albeit with nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel. We provide accompanying analysis of the computation and communication costs of the algorithms. To demonstrate the compression and accuracy of the method, we apply our approach to real-world data sets from combustion science simulations. We also provide detailed performance results, including parallel performance in both weak and strong scaling experiments. | |||||

BibTeX:
@techreport{ABK15-TR, author = {Woody Austin and Grey Ballard and Tamara G. Kolda}, title = {Parallel Tensor Compression for Large-Scale Scientific Data}, year = {2015}, number = {1510.06689}, url = {http://arxiv.org/abs/1510.06689} } |
|||||

A. Azad, G. Ballard, A. Buluc, J. Demmel, L. Grigori, O. Schwartz, S. Toledo and S. Williams,
published version}
| |||||

Abstract: Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdos-Renyi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first ever implementation of the 3D SpGEMM formulation that also exploits multiple (intra-node and inter-node) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research. | |||||

BibTeX:
@techreport{AB+15-TR, author = {Ariful Azad and Grey Ballard and Aydin Buluc and James Demmel and Laura Grigori and Oded Schwartz and Sivan Toledo and Samuel Williams}, title = {Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication}, month = {October}, year = {2015}, number = {1510.00844}, url = {http://arxiv.org/abs/1510.00844} } |
|||||

G. Ballard, A. R. Benson, A. Druinsky, B. Lipshitz and O. Schwartz,
published version}
| |||||

Abstract: Fast algorithms for matrix multiplication, namely those that perform asymptotically fewer scalar operations than the classical algorithm, have been considered primarily of theoretical interest. Apart from Strassen's original algorithm, few fast algorithms have been efficiently implemented or used in practical applications. However, there exist many practical alternatives to Strassen's algorithm with varying performance and numerical properties. Fast algorithms are known to be numerically stable, but because their error bounds are slightly weaker than the classical algorithm, they are not used even in cases where they provide a performance benefit. We argue in this paper that the numerical sacrifice of fast algorithms, particularly for the typical use cases of practical algorithms, is not prohibitive, and we explore ways to improve the accuracy both theoretically and empirically. The numerical accuracy of fast matrix multiplication depends on properties of the algorithm and of the input matrices, and we consider both contributions independently. We generalize and tighten previous error analyses of fast algorithms and compare their properties. We discuss algorithmic techniques for improving the error guarantees from two perspectives: manipulating the algorithms, and reducing input anomalies by various forms of diagonal scaling. Finally, we benchmark performance and demonstrate our improved numerical accuracy. | |||||

BibTeX:
@techreport{BBDLS15-TR, author = {Grey Ballard and Austin R. Benson and Alex Druinsky and Benjamin Lipshitz and Oded Schwartz}, title = {Improving the numerical stability of fast matrix multiplication algorithms}, month = {July}, year = {2015}, number = {1507.00687}, url = {http://arxiv.org/abs/1507.00687} } |
|||||

G. Ballard, J. Demmel, L. Grigori, N. Knight, M. Jacquelin and H. D. Nguyen,
| |||||

Abstract: The Tall-Skinny QR (TSQR) algorithm is more communication efficient than the standard Householder algorithm for QR decomposition of matrices with many more rows than columns. However, TSQR produces a different representation of the orthogonal factor and therefore requires more software development to support the new representation. Further, implicitly applying the orthogonal factor to the trailing matrix in the context of factoring a square matrix is more complicated and costly than with the Householder representation. We show how to perform TSQR and then reconstruct the Householder vector representation with the same asymptotic communication efficiency and little extra computational cost. We demonstrate the high performance and numerical stability of this algorithm both theoretically and empirically. The new Householder reconstruction algorithm allows us to design more efficient parallel QR algorithms, with significantly lower latency cost compared to Householder QR and lower bandwidth and latency costs compared with Communication-Avoiding QR (CAQR) algorithm. Experiments on supercomputers demonstrate the benefits of the communication cost improvements: in particular, our experiments show substantial improvements over tuned library implementations for tall-and-skinny matrices. We also provide algorithmic improvements to the Householder QR and CAQR algorithms, and we investigate several alternatives to the Householder reconstruction algorithm that sacrifice guarantees on numerical stability in some cases in order to obtain higher performance. | |||||

BibTeX:
@article{BDGJ+15, author = {Ballard, Grey and Demmel, James and Grigori, Laura and Knight, Nicholas and Jacquelin, Mathias and Nguyen, Hong Diep}, title = {Reconstructing {H}ouseholder Vectors from Tall-Skinny {QR}}, journal = {Journal of Parallel and Distributed Computing}, month = {August}, year = {2015}, volume = {85}, pages = {3-31}, url = {http://www.sciencedirect.com/science/article/pii/S074373151500101X} } |
|||||

G. Ballard, J. Demmel and N. Knight,
technical report}
| |||||

Abstract: The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present sequential and distributed-memory parallel algorithms for tridiagonalizing full symmetric and symmetric band matrices that asymptotically reduce communication compared to previous approaches. The tridiagonalization of a symmetric band matrix is a key kernel in solving the symmetric eigenvalue problem for both full and band matrices. In order to preserve structure, tridiagonalization routines use annihilate-and-chase procedures that previously have suffered from poor data locality and high parallel latency cost. We improve both by reorganizing the computation and obtain asymptotic improvements. We also propose new algorithms for reducing a full symmetric matrix to band form in a communication-efficient manner. In this article, we consider the cases of computing eigenvalues only and of computing eigenvalues and all eigenvectors. | |||||

BibTeX:
@article{BDK15, author = {Ballard, Grey and Demmel, James and Knight, Nicholas}, title = {Avoiding Communication in Successive Band Reduction}, journal = {ACM Transactions on Parallel Computing}, publisher = {ACM}, address = {New York, NY, USA}, month = {February}, year = {2015}, volume = {1}, number = {2}, pages = {11:1--11:37}, url = {http://doi.acm.org/10.1145/2686877} } |
|||||

G. Ballard, A. Druinsky, N. Knight and O. Schwartz,
| |||||

Abstract: The performance of parallel algorithms for sparse matrix-matrix multiplication is typically determined by the amount of interprocessor communication performed, which in turn depends on the nonzero structure of the input matrices. In this paper, we characterize the communication cost of a sparse matrix-matrix multiplication algorithm in terms of the size of a cut of an associated hypergraph that encodes the computation for a given input nonzero structure. Obtaining an optimal algorithm corresponds to solving a hypergraph partitioning problem. Our hypergraph model generalizes several existing models for sparse matrix-vector multiplication, and we can leverage hypergraph partitioners developed for that computation to improve application-specific algorithms for multiplying sparse matrices. | |||||

BibTeX:
@inproceedings{BDKS15, author = {Ballard, Grey and Druinsky, Alex and Knight, Nicholas and Schwartz, Oded}, title = {Brief Announcement: Hypergraph Partitioning for Parallel Sparse Matrix-Matrix Multiplication}, booktitle = {Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures}, publisher = {ACM}, address = {New York, NY, USA}, month = {June}, year = {2015}, series = {SPAA '15}, pages = {86--88}, url = {http://doi.acm.org/10.1145/2755573.2755613} } |
|||||

G. Ballard, T. G. Kolda, A. Pinar and C. Seshadhri,
technical report}
| |||||

Abstract: Given two sets of vectors A and B, our problem is to find the top-t dot products among all possible pairs where one vector is taken from each set. This is a fundamental mathematical problem that appears in numerous data applications involving similarity search, link prediction, and collaborative filtering. We propose a sampling-based approach that avoids direct computation of all dot products. We select diamonds (i.e., four-cycles) from the weighted tripartite representation of A and B. The probability of selecting a diamond corresponding to pair (i,j) is proportional to the square of the dot product between vectors i (from A) and j (from B), amplifying the focus on the largest-magnitude entries. Experimental results indicate that diamond sampling is orders of magnitude faster than direct computation and requires far fewer samples than any competing approach. We also apply diamond sampling to the special case of maximum inner product search, and get significantly better results than the state-of-the-art hashing methods. | |||||

BibTeX:
@inproceedings{BKPS15, author = {Grey Ballard and Tamara G. Kolda and Ali Pinar and C. Seshadhri}, title = {Diamond Sampling for Approximate Maximum All-pairs Dot-product {(MAD)} Search}, booktitle = {15th IEEE International Conference on Data Mining}, publisher = {IEEE Computer Society}, month = {November}, year = {2015}, series = {ICDM '15}, pages = {11--20}, url = {http://dx.doi.org/10.1109/ICDM.2015.46} } |
|||||

G. Ballard, T. G. Kolda, A. Pinar and C. Seshadhri,
published version}
| |||||

Abstract: Given two sets of vectors A and B, our problem is to find the top-t dot products among all possible pairs where one vector is taken from each set. This is a fundamental mathematical problem that appears in numerous data applications involving similarity search, link prediction, and collaborative filtering. We propose a sampling-based approach that avoids direct computation of all dot products. We select diamonds (i.e., four-cycles) from the weighted tripartite representation of A and B. The probability of selecting a diamond corresponding to pair (i,j) is proportional to the square of the dot product between vectors i (from A) and j (from B), amplifying the focus on the largest-magnitude entries. Experimental results indicate that diamond sampling is orders of magnitude faster than direct computation and requires far fewer samples than any competing approach. We also apply diamond sampling to the special case of maximum inner product search, and get significantly better results than the state-of-the-art hashing methods. | |||||

BibTeX:
@techreport{BKPS15-TR, author = {Grey Ballard and Tamara G. Kolda and Ali Pinar and C. Seshadhri}, title = {Diamond Sampling for Approximate Maximum All-pairs Dot-product {(MAD)} Search}, month = {June}, year = {2015}, url = {http://arxiv.org/abs/1506.03872} } |
|||||

G. Ballard, C. Siefert and J. Hu,
published version}
| |||||

Abstract: We consider the sequence of sparse matrix-matrix multiplications performed during the setup phase of algebraic multigrid. In particular, we show that the most commonly used parallel algorithm is often not the most communication-efficient one for all of the matrix-matrix multiplications involved. By using an alternative algorithm, we show that the communication costs are reduced (in theory and practice), and we demonstrate the performance benefit for both model (structured) and more realistic unstructured problems on large-scale distributed-memory parallel systems. Our theoretical analysis shows that we can reduce communication by a factor of up to 5.4 for a model problem, and we observe in our empirical evaluation communication reductions of factors up to 4.7 for structured problems and 3.7 for unstructured problems. These reductions in communication translate to run-time speedups of up to factors of 2.3 and 2.5, respectively. | |||||

BibTeX:
@techreport{BSH15-TR, author = {Ballard, Grey and Siefert, Christopher and Hu, Jonathan}, title = {Reducing Communication Costs for Sparse Matrix Multiplication within Algebraic Multigrid}, month = {September}, year = {2015}, number = {SAND2015-3275}, url = {http://prod.sandia.gov/techlib/access-control.cgi/2015/153275.pdf} } |
|||||

A. R. Benson and G. Ballard,
technical report}
| |||||

Abstract: Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's fast algorithm on modest problem sizes and shapes. Furthermore, we show that the best choice of fast algorithm depends not only on the size of the matrices but also the shape. We develop a code generation tool to automatically implement multiple sequential and shared-memory parallel variants of each fast algorithm, including our novel parallelization scheme. This allows us to rapidly benchmark over 20 fast algorithms on several problem sizes. Furthermore, we discuss a number of practical implementation issues for these algorithms on shared-memory machines that can direct further research on making fast algorithms practical. | |||||

BibTeX:
@inproceedings{BB15, author = {Benson, Austin R. and Ballard, Grey}, title = {A Framework for Practical Parallel Fast Matrix Multiplication}, booktitle = {Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming}, publisher = {ACM}, address = {New York, NY, USA}, month = {February}, year = {2015}, series = {PPoPP 2015}, pages = {42--53}, url = {http://doi.acm.org/10.1145/2688500.2688513} } |
|||||

R. Kannan, G. Ballard and H. Park,
published version}
| |||||

Abstract: Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors W and H, for the given input matrix A, such that A is approximated by WH. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient distributed algorithm to solve the problem for big datasets. Existing distributed-memory algorithms are limited in terms of performance and applicability, as they are implemented using Hadoop and are designed only for sparse matrices. We propose a distributed-memory parallel algorithm that computes the factorization by iteratively solving alternating non-negative least squares (NLS) subproblems for W and H. To our knowledge, our algorithm is the first high-performance parallel algorithm for NMF. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). As opposed to previous implementations, our algorithm is also flexible: (1) it performs well for both dense and sparse matrices, and (2) it allows the user to choose from among multiple algorithms for solving local NLS subproblems within the alternating iterations. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements. | |||||

BibTeX:
@techreport{KBP15-TR, author = {Ramakrishnan Kannan and Grey Ballard and Haesun Park}, title = {A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization}, month = {September}, year = {2015}, number = {1509.09313}, url = {http://arxiv.org/abs/1509.09313} } |
|||||

G. Ballard and J. Baxley,
| |||||

BibTeX:
@article{BB14, author = {Grey Ballard and John Baxley}, title = {Asymptotic Behavior of the Eigenvalues of Toeplitz Integral Operators Associated with the Hankel Transform}, journal = {Dynamic Systems and Applications}, month = {December}, year = {2014}, volume = {23}, number = {4}, pages = {505--530} } |
|||||

G. Ballard, D. Becker, J. Demmel, J. Dongarra, A. Druinsky, I. Peled, O. Schwartz, S. Toledo and I. Yamazaki,
technical report}
| |||||

Abstract: We describe and analyze a novel symmetric triangular factorization algorithm. The algorithm is essentially a block version of Aasen's triangular tridiagonalization. It factors a dense symmetric matrix A as the product A=PLTL^TP^T, where P is a permutation matrix, L is lower triangular, and T is block tridiagonal and banded. The algorithm is the first symmetric-indefinite communication-avoiding factorization: it performs an asymptotically optimal amount of communication in a two-level memory hierarchy for almost any cache-line size. Adaptations of the algorithm to parallel computers are likely to be communication efficient as well; one such adaptation has been recently published. The current paper describes the algorithm, proves that it is numerically stable, and proves that it is communication optimal. | |||||

BibTeX:
@article{BBDD+14, author = {Ballard, G. and Becker, D. and Demmel, J. and Dongarra, J. and Druinsky, A. and Peled, I. and Schwartz, O. and Toledo, S. and Yamazaki, I.}, title = {Communication-Avoiding Symmetric-Indefinite Factorization}, journal = {SIAM Journal on Matrix Analysis and Applications}, month = {November}, year = {2014}, volume = {35}, number = {4}, pages = {1364--1406}, url = {http://dx.doi.org/10.1137/130929060} } |
|||||

G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight and O. Schwartz,
| |||||

Abstract: The traditional metric for the efficiency of a numerical algorithm has been the number of arithmetic operations it performs. Technological trends have long been reducing the time to perform an arithmetic operation, so it is no longer the bottleneck in many algorithms; rather, communication, or moving data, is the bottleneck. This motivates us to seek algorithms that move as little data as possible, either between levels of a memory hierarchy or between parallel processors over a network. In this paper we summarize recent progress in three aspects of this problem. First we describe lower bounds on communication. Some of these generalize known lower bounds for dense classical (O(n^3)) matrix multiplication to all direct methods of linear algebra, to sequential and parallel algorithms, and to dense and sparse matrices. We also present lower bounds for Strassen-like algorithms, and for iterative methods, in particular Krylov subspace methods applied to sparse matrices. Second, we compare these lower bounds to widely used versions of these algorithms, and note that these widely used algorithms usually communicate asymptotically more than is necessary. Third, we identify or invent new algorithms for most linear algebra problems that do attain these lower bounds, and demonstrate large speed-ups in theory and practice. | |||||

BibTeX:
@article{BCDH+14, author = {Ballard, G. and Carson, E. and Demmel, J. and Hoemmen, M. and Knight, N. and Schwartz, O.}, title = {Communication lower bounds and optimal algorithms for numerical linear algebra}, journal = {Acta Numerica}, month = {May}, year = {2014}, volume = {23}, pages = {1--155}, url = {http://journals.cambridge.org/article_S0962492914000038} } |
|||||

G. Ballard, J. Demmel, L. Grigori, M. Jacquelin, H. D. Nguyen and E. Solomonik,
technical report}
| |||||

Abstract: The Tall-Skinny QR (TSQR) algorithm is more communication efficient than the standard Householder algorithm for QR decomposition of matrices with many more rows than columns. However, TSQR produces a different representation of the orthogonal factor and therefore requires more software development to support the new representation. Further, implicitly applying the orthogonal factor to the trailing matrix in the context of factoring a square matrix is more complicated and costly than with the Householder representation. We show how to perform TSQR and then reconstruct the Householder vector representation with the same asymptotic communication efficiency and little extra computational cost. We demonstrate the high performance and numerical stability of this algorithm both theoretically and empirically. The new Householder reconstruction algorithm allows us to design more efficient parallel QR algorithms, with significantly lower latency cost compared to Householder QR and lower bandwidth and latency costs compared with Communication-Avoiding QR (CAQR) algorithm. As a result, our final parallel QR algorithm outperforms ScaLAPACK and Elemental implementations of Householder QR and our implementation of CAQR on the Hopper Cray XE6 NERSC system. We also provide algorithmic improvements to the ScaLAPACK and CAQR algorithms. | |||||

BibTeX:
@inproceedings{BDGJ+14, author = {Ballard, Grey and Demmel, James and Grigori, Laura and Jacquelin, Mathias and Nguyen, Hong Diep and Solomonik, Edgar}, title = {Reconstructing {H}ouseholder Vectors from Tall-Skinny {QR}}, booktitle = {IEEE 28th International Parallel and Distributed Processing Symposium}, month = {May}, year = {2014}, pages = {1159--1170}, url = {http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6877344} } |
|||||

G. Ballard, J. Demmel, O. Holtz and O. Schwartz,
| |||||

Abstract: Algorithms have historically been evaluated in terms of the number of arithmetic operations they performed. This analysis is no longer sufficient for predicting running times on today's machines. Moving data through memory hierarchies and among processors requires much more time (and energy) than performing computations. Hardware trends suggest that the relative costs of this communication will only increase. Proving lower bounds on the communication of algorithms and finding algorithms that attain these bounds are therefore fundamental goals. We show that the communication cost of an algorithm is closely related to the graph expansion properties of its corresponding computation graph. Matrix multiplication is one of the most fundamental problems in scientific computing and in parallel computing. Applying expansion analysis to Strassen's and other fast matrix multiplication algorithms, we obtain the first lower bounds on their communication costs. These bounds show that the current sequential algorithms are optimal but that previous parallel algorithms communicate more than necessary. Our new parallelization of Strassen's algorithm is communication-optimal and outperforms all previous matrix multiplication algorithms. | |||||

BibTeX:
@article{BDHS14, author = {Ballard, Grey and Demmel, James and Holtz, Olga and Schwartz, Oded}, title = {Communication Costs of Strassen's Matrix Multiplication}, journal = {Communications of the ACM}, publisher = {ACM}, address = {New York, NY, USA}, month = {February}, year = {2014}, volume = {57}, number = {2}, pages = {107--114}, url = {http://doi.acm.org/10.1145/2556647.2556660} } |
|||||

A. Benson and G. Ballard,
published version}
| |||||

Abstract: Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's fast algorithm on modest problem sizes and shapes. Furthermore, we show that the best choice of fast algorithm depends not only on the size of the matrices but also the shape. We develop a code generation tool to automatically implement multiple sequential and shared-memory parallel variants of each fast algorithm, including our novel parallelization scheme. This allows us to rapidly benchmark over 20 fast algorithms on several problem sizes. Furthermore, we discuss a number of practical implementation issues for these algorithms on shared-memory machines that can direct further research on making fast algorithms practical. | |||||

BibTeX:
@techreport{BB14-TR, author = {Benson, Austin and Ballard, Grey}, title = {A Framework for Practical Parallel Fast Matrix Multiplication}, month = {September}, year = {2014}, url = {http://arxiv.org/abs/1409.2908} } |
|||||

G. Ballard,
| |||||

Abstract: Dense linear algebra computations are essential to nearly every problem in scientific computing and to countless other fields. Most matrix computations enjoy a high computational intensity (i.e., ratio of computation to data), and therefore the algorithms for the computations have a potential for high efficiency. However, performance for many linear algebra algorithms is limited by the cost of moving data between processors on a parallel computer or throughout the memory hierarchy of a single processor, which we will refer to generally as communication. Technological trends indicate that algorithmic performance will become even more limited by communication in the future. In this thesis, we consider the fundamental computations within dense linear algebra and address the following question: can we significantly improve the current algorithms for these computations, in terms of the communication they require and their performance in practice? To answer the question, we analyze algorithms on sequential and parallel architectural models that are simple enough to determine coarse communication costs but accurate enough to predict performance of implementations on real hardware. For most of the computations, we prove lower bounds on the communication that any algorithm must perform. If an algorithm exists with communication costs that match the lower bounds (at least in an asymptotic sense), we call the algorithm communication optimal. In many cases, the most commonly used algorithms are not communication optimal, and we can develop new algorithms that require less data movement and attain the communication lower bounds. In this thesis, we develop both new communication lower bounds and new algorithms, tightening (and in many cases closing) the gap between best known lower bound and best known algorithm (or upper bound). We consider both sequential and parallel algorithms, and we asses both classical and fast algorithms (e.g., Strassen's matrix multiplication algorithm). In particular, the central contributions of this thesis are: proving new communication lower bounds for nearly all classical direct linear algebra computations (dense or sparse), including factorizations for solving linear systems, least squares problems, and eigenvalue and singular value problems; proving new communication lower bounds for Strassen's and other fast matrix multiplication algorithms; proving new parallel communication lower bounds for classical and fast computations that set limits on an algorithm's ability to perfectly strong scale; summarizing the state-of-the-art in communication efficiency for both sequential and parallel algorithms for the computations to which the lower bounds apply; developing a new communication-optimal algorithm for computing a symmetric-indefinite factorization (observing speedups of up to 2.8x compared to alternative shared-memory parallel algorithms); developing new, more communication-efficient algorithms for reducing a symmetric band matrix to tridiagonal form via orthogonal similar transformations (observing speedups of 2--6x compared to alternative sequential and parallel algorithms); and developing a new communication-optimal parallelization of Strassen's matrix multiplication algorithm (observing speedups of up to 2.84x compared to alternative distributed-memory parallel algorithms). | |||||

BibTeX:
@phdthesis{Ballard13, author = {Ballard, Grey}, title = {Avoiding Communication in Dense Linear Algebra}, month = {Aug}, year = {2013}, number = {UCB/EECS-2013-151}, url = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-151.html} } |
|||||

G. Ballard, D. Becker, J. Demmel, J. Dongarra, A. Druinsky, I. Peled, O. Schwartz, S. Toledo and I. Yamazaki,
newer version}
{published version}
| |||||

BibTeX:
@techreport{BBDD+13-TR, author = {Ballard, G. and Becker, D. and Demmel, J. and Dongarra, J. and Druinsky, A. and Peled, I. and Schwartz, O. and Toledo, S. and Yamazaki, I.}, title = {Communication-Avoiding Symmetric-Indefinite Factorization}, month = {July}, year = {2013}, number = {UCB/EECS-2013-127}, url = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-127.html} } |
|||||

G. Ballard, D. Becker, J. Demmel, J. Dongarra, A. Druinsky, I. Peled, O. Schwartz, S. Toledo and I. Yamazaki,
| |||||

Abstract: Factorization of a dense symmetric indefinite matrix is a key computational kernel in many scientific and engineering simulations. However, there is no scalable factorization algorithm that takes advantage of the symmetry and guarantees numerical stability through pivoting at the same time. This is because such an algorithm exhibits many of the fundamental challenges in parallel programming like irregular data accesses and irregular task dependencies. In this paper, we address these challenges in a tiled implementation of a blocked Aasen's algorithm using a dynamic scheduler. To fully exploit the limited parallelism in this left-looking algorithm, we study several performance enhancing techniques; e.g., parallel reduction to update a panel, tall-skinny LU factorization algorithms to factorize the panel, and a parallel implementation of symmetric pivoting. Our performance results on up to 48 AMD Opteron processors demonstrate that our implementation obtains speedups of up to 2.8 over MKL, while losing only one or two digits in the computed residual norms. | |||||

BibTeX:
@inproceedings{BBDD+13, author = {Ballard, G. and Becker, D. and Demmel, J. and Dongarra, J. and Druinsky, A. and Peled, I. and Schwartz, O. and Toledo, S. and Yamazaki, I.}, title = {Implementing a Blocked {Aasen's} Algorithm with a Dynamic Scheduler on Multicore Architectures}, booktitle = {Proceedings of the 27th IEEE International Parallel Distributed Processing Symposium}, month = {May}, year = {2013}, series = {IPDPS '13}, pages = {895--907}, url = {http://dx.doi.org/10.1109/IPDPS.2013.98} } |
|||||

G. Ballard, A. Buluç, J. Demmel, L. Grigori, B. Lipshitz, O. Schwartz and S. Toledo,
| |||||

Abstract: Parallel algorithms for sparse matrix-matrix multiplication typically spend most of their time on inter-processor communication rather than on computation, and hardware trends predict the relative cost of communication will only increase. Thus, sparse matrix multiplication algorithms must minimize communication costs in order to scale to large processor counts. In this paper, we consider multiplying sparse matrices corresponding to Erdos-Renyi random graphs on distributed-memory parallel machines. We prove a new lower bound on the expected communication cost for a wide class of algorithms. Our analysis of existing algorithms shows that, while some are optimal for a limited range of matrix density and number of processors, none is optimal in general. We obtain two new parallel algorithms and prove that they match the expected communication cost lower bound, and hence they are optimal. | |||||

BibTeX:
@inproceedings{BBDG+13, author = {Ballard, G. and Bulu\c{c}, A. and Demmel, J. and Grigori, L. and Lipshitz, B. and Schwartz, O. and Toledo, S.}, title = {Communication optimal parallel multiplication of sparse random matrices}, booktitle = {Proceedings of the 25th ACM Symposium on Parallelism in Algorithms and Architectures}, publisher = {ACM}, address = {Montr\'eal, Qu\'ebec, Canada}, month = {June}, year = {2013}, series = {SPAA '13}, pages = {222--231}, url = {http://doi.acm.org/10.1145/2486159.2486196} } |
|||||

G. Ballard, J. Demmel, L. Grigori, M. Jacquelin, H. Nguyen and E. Solomonik,
published version}
| |||||

Abstract: The Tall-Skinny QR (TSQR) algorithm is more communication efficient than the standard Householder algorithm for QR decomposition of matrices with many more rows than columns. However, TSQR produces a different representation of the orthogonal factor and therefore requires more software development to support the new representation. Further, implicitly applying the orthogonal factor to the trailing matrix in the context of factoring a square matrix is more complicated and costly than with the Householder representation. We show how to perform TSQR and then reconstruct the Householder vector representation with the same asymptotic communication efficiency and little extra computational cost. We demonstrate the high performance and numerical stability of this algorithm both theoretically and empirically. The new Householder reconstruction algorithm allows us to design more efficient parallel QR algorithms, with significantly lower latency cost compared to Householder QR and lower bandwidth and latency costs compared with Communication-Avoiding QR (CAQR) algorithm. As a result, our final parallel QR algorithm outperforms ScaLAPACK and Elemental implementations of Householder QR and our implementation of CAQR on the Hopper Cray XE6 NERSC system. | |||||

BibTeX:
@techreport{BDGJ+13-TR, author = {Ballard, G. and Demmel, J. and Grigori, L. and Jacquelin, M. and Nguyen, H.D. and Solomonik, E.}, title = {Reconstructing {H}ouseholder Vectors from Tall-Skinny {QR}}, month = {Oct}, year = {2013}, number = {UCB/EECS-2013-175}, url = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-175.html} } |
|||||

G. Ballard, J. Demmel and N. Knight,
published version}
| |||||

BibTeX:
@techreport{BDK13-TR, author = {Ballard, G. and Demmel, J. and Knight, N.}, title = {Avoiding Communication in Successive Band Reduction}, month = {July}, year = {2013}, number = {UCB/EECS-2013-131}, url = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-131.html} } |
|||||

G. Ballard, J. Demmel, B. Lipshitz, O. Schwartz and S. Toledo,
| |||||

Abstract: High performance for numerical linear algebra often comes at the expense of stability. Computing the LU decomposition of a matrix via Gaussian Elimination can be organized so that the computation involves regular and efficient data access. However, maintaining numerical stability via partial pivoting involves row interchanges that lead to inefficient data access patterns. To optimize communication efficiency throughout the memory hierarchy we confront two seemingly contradictory requirements: partial pivoting is efficient with column-major layout, whereas a block-recursive layout is optimal for the rest of the computation. We resolve this by introducing a shape morphing procedure that dynamically matches the layout to the computation throughout the algorithm, and show that Gaussian Elimination with partial pivoting can be performed in a communication efficient and cache-oblivious way. Our technique extends to QR decomposition, where computing Householder vectors prefers a different data layout than the rest of the computation. | |||||

BibTeX:
@inproceedings{BDLST13, author = {Ballard, G. and Demmel, J. and Lipshitz, B. and Schwartz, O. and Toledo, S.}, title = {Communication efficient {Gaussian} elimination with partial pivoting using a shape morphing data layout}, booktitle = {Proceedings of the 25th ACM Symposium on Parallelism in Algorithms and Architectures}, publisher = {ACM}, address = {Montr\'eal, Qu\'ebec, Canada}, month = {June}, year = {2013}, series = {SPAA '13}, pages = {232--240}, url = {http://doi.acm.org/10.1145/2486159.2486198} } |
|||||

G. Ballard, J. Demmel, O. Holtz, B. Lipshitz and O. Schwartz,
technical report}
| |||||

Abstract: A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz, 2012). Both algorithms scale perfectly, but only up to some number of processors where the inter-processor communication no longer scales. We obtain a memory-independent communication cost lower bound on classical and Strassen-based distributed-memory matrix multiplication algorithms. These bounds imply that no classical or Strassen-based parallel matrix multiplication algorithm can strongly scale perfectly beyond the ranges already attained by the two parallel algorithms mentioned above. The memory-independent bounds and the strong scaling bounds generalize to other algorithms. | |||||

BibTeX:
@inproceedings{BDHLS12-SS, author = {G. Ballard and J. Demmel and O. Holtz and B. Lipshitz and O. Schwartz}, title = {Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds}, booktitle = {{Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures}}, publisher = {ACM}, address = {New York, NY, USA}, month = {June}, year = {2012}, series = {SPAA '12}, pages = {77--79}, url = {http://doi.acm.org/10.1145/2312005.2312021} } |
|||||

G. Ballard, J. Demmel, O. Holtz, B. Lipshitz and O. Schwartz,
| |||||

Abstract: Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA '11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range. Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n=94080, where the number of processors ranges from 49 to 7203. Our parallelization approach generalizes to other fast matrix multiplication algorithms. | |||||

BibTeX:
@inproceedings{BDHLS12-CAPS, author = {G. Ballard and J. Demmel and O. Holtz and B. Lipshitz and O. Schwartz}, title = {Communication-optimal parallel algorithm for {S}trassen's matrix multiplication}, booktitle = {{Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures}}, publisher = {ACM}, address = {New York, NY, USA}, month = {June}, year = {2012}, series = {SPAA '12}, pages = {193--204}, url = {http://doi.acm.org/10.1145/2312005.2312044} } |
|||||

G. Ballard, J. Demmel, O. Holtz, B. Lipshitz and O. Schwartz,
| |||||

Abstract: Graph expansion analysis of computational DAGs is useful for obtaining communication cost lower bounds where previous methods, such as geometric embedding, are not applicable. This has recently been demonstrated for Strassen's and Strassen-like fast square matrix multiplication algorithms. Here we extend the expansion analysis approach to fast algorithms for rectangular matrix multiplication, obtaining a new class of communication cost lower bounds. These apply, for example to the algorithms of Bini et al. (1979) and the algorithms of Hopcroft and Kerr (1971). Some of our bounds are proved to be optimal. | |||||

BibTeX:
@incollection{BDHLS12-RECT, author = {Ballard, G. and Demmel, J. and Holtz, O. and Lipshitz, B. and Schwartz, O.}, title = {Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication}, booktitle = {Design and Analysis of Algorithms}, editor = {Even, G. and Rawitz, D.}, publisher = {Springer Berlin Heidelberg}, month = {December}, year = {2012}, series = {Lecture Notes in Computer Science}, volume = {7659}, pages = {13--36}, url = {http://dx.doi.org/10.1007/978-3-642-34862-4_2} } |
|||||

G. Ballard, J. Demmel, O. Holtz, B. Lipshitz and O. Schwartz,
published version}
| |||||

BibTeX:
@techreport{BDHLS12-SS-TR, author = {Ballard, G. and Demmel, J. and Holtz, O. and Lipshitz, B. and Schwartz, O.}, title = {Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds}, month = {March}, year = {2012}, number = {UCB/EECS-2012-31}, url = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-31.html} } |
|||||

G. Ballard, J. Demmel, O. Holtz and O. Schwartz,
older version}
| |||||

Abstract: The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communication costs. In the sequential case, where the processor has a fast memory of size M, too small to store three n-by-n matrices, the lower bound on the number of words moved between fast and slow memory is, for a large class of matrix multiplication algorithms, (n/√ M)^0 ot M), where 0 is the exponent in the arithmetic count (e.g., 0 = lg 7 for Strassen, and 0 = 3 for conventional matrix multiplication). With p parallel processors, each with fast memory of size M, the lower bound is asymptotically lower by a factor of p. These bounds are attainable both for sequential and for parallel algorithms and hence optimal. | |||||

BibTeX:
@article{BDHS12, author = {G. Ballard and J. Demmel and O. Holtz and O. Schwartz}, title = {Graph expansion and communication costs of fast matrix multiplication}, journal = {Journal of the ACM}, publisher = {ACM}, address = {New York, NY, USA}, month = {December}, year = {2012}, volume = {59}, number = {6}, pages = {32:1--32:23}, url = {http://doi.acm.org/10.1145/2395116.2395121} } |
|||||

G. Ballard, J. Demmel, O. Holtz and O. Schwartz,
| |||||

BibTeX:
@techreport{BDHS12-FastLA, author = {G. Ballard and J. Demmel and O. Holtz and O. Schwartz}, title = {Sequential Communication Bounds for Fast Linear Algebra}, month = {March}, year = {2012}, number = {UCB/EECS-2012-36}, url = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-36.html} } |
|||||

G. Ballard, J. Demmel and N. Knight,
| |||||

Abstract: The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present both theoretical and practical results for tridiagonalizing a symmetric band matrix: we present an algorithm that asymptotically reduces communication, and we show that it indeed performs well in practice. The tridiagonalization of a symmetric band matrix is a key kernel in solving the symmetric eigenvalue problem for both full and band matrices. In order to preserve sparsity, tridiagonalization routines use annihilate-and-chase procedures that previously have suffered from poor data locality. We improve data locality by reorganizing the computation, asymptotically reducing communication costs compared to existing algorithms. Our sequential implementation demonstrates that avoiding communication improves runtime even at the expense of extra arithmetic: we observe a 2x speedup over Intel MKL while doing 43% more floating point operations. Our parallel implementation targets shared-memory multicore platforms. It uses pipelined parallelism and a static scheduler while retaining the locality properties of the sequential algorithm. Due to lightweight synchronization and effective data reuse, we see 9.5x scaling over our serial code and up to 6x speedup over the PLASMA library, comparing parallel performance on a ten-core processor. | |||||

BibTeX:
@inproceedings{BDK12, author = {G. Ballard and J. Demmel and N. Knight}, title = {Communication avoiding successive band reduction}, booktitle = {Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming}, publisher = {ACM}, address = {New York, NY, USA}, month = {February}, year = {2012}, series = {PPoPP '12}, pages = {35--44}, url = {http://doi.acm.org/10.1145/2145816.2145822} } |
|||||

B. Lipshitz, G. Ballard, J. Demmel and O. Schwartz,
| |||||

Abstract: Matrix multiplication is a fundamental kernel of many high performance and scientific computing applications. Most parallel implementations use classical O(n3) matrix multiplication, even though there exist algorithms with lower arithmetic complexity. We recently presented a new Communication-Avoiding Parallel Strassen algorithm (CAPS), based on Strassen's fast matrix multiplication, that minimizes communication (SPAA '12). It communicates asymptotically less than all classical and all previous Strassen-based algorithms, and it attains theoretical lower bounds. In this paper we show that CAPS is also faster in practice. We benchmark and compare its performance to previous algorithms on Hopper (Cray XE6), Intrepid (IBM BG/P), and Franklin (Cray XT4). We demonstrate significant speedups over previous algorithms both for large matrices and for small matrices on large numbers of processors. We model and analyze the performance of CAPS and predict its performance on future exascale platforms. | |||||

BibTeX:
@inproceedings{LBDS12, author = {Lipshitz, B. and Ballard, G. and Demmel, J. and Schwartz, O.}, title = {Communication-avoiding parallel {S}trassen: {I}mplementation and performance}, booktitle = {Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis}, address = {Salt Lake City, Utah}, month = {November}, year = {2012}, series = {SC '12}, pages = {101:1--101:11}, url = {http://dl.acm.org/citation.cfm?id=2388996.2389133} } |
|||||

M. Anderson, G. Ballard, J. Demmel and K. Keutzer,
| |||||

Abstract: We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a general-purpose processor or using entirely bandwidth-bound operations, incurring data transfer overheads. In contrast, our QR is done entirely on the GPU using compute-bound kernels, meaning performance is good regardless of the width of the matrix. As a result, we outperform CULA, a parallel linear algebra library for GPUs by up to 17x for tall-skinny matrices and Intel's Math Kernel Library (MKL) by up to 12x. We also discuss stationary video background subtraction as a motivating application. We apply a recent statistical approach, which requires many iterations of computing the singular value decomposition of a tall-skinny matrix. Using CAQR as a first step to getting the singular value decomposition, we are able to get the answer 3x faster than if we use a traditional bandwidth-bound GPU QR factorization tuned specifically for that matrix size, and 30x faster than if we use Intel's Math Kernel Library (MKL) singular value decomposition routine on a multicore CPU. | |||||

BibTeX:
@inproceedings{ABDK11, author = {Anderson, M. and Ballard, G. and Demmel, J. and Keutzer, K.}, title = {Communication-Avoiding {QR} Decomposition for {GPU}s}, booktitle = {Proceedings of the 2011 IEEE International Parallel \& Distributed Processing Symposium}, publisher = {IEEE Computer Society}, address = {Washington, DC, USA}, month = {May}, year = {2011}, series = {{IPDPS} '11}, pages = {48--58}, url = {http://dx.doi.org/10.1109/IPDPS.2011.15} } |
|||||

G. Ballard, J. Demmel and I. Dumitriu,
| |||||

BibTeX:
@techreport{BDD11, author = {Ballard, G. and Demmel, J. and Dumitriu, I.}, title = {Communication-optimal parallel and sequential eigenvalue and singular value algorithms}, month = {February}, year = {2011}, number = {EECS-2011-14}, url = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-14.html} } |
|||||

G. Ballard, J. Demmel and A. Gearhart,
| |||||

Abstract: As the gap between the cost of communication (i.e., data movement) and computation continues to grow, the importance of pursuing algorithms which minimize communication also increases. Toward this end, we seek asymptotic communication lower bounds for general memory models and classes of algorithms. Recent work has established lower bounds for a wide set of linear algebra algorithms on a sequential machine and on a parallel machine with identical processors. This work extends these previous bounds to a heterogeneous model in which processors access data and perform floating point operations at differing speeds. We also present an algorithm for dense matrix multiplication which attains the lower bound. | |||||

BibTeX:
@inproceedings{BDG11, author = {Ballard, G. and Demmel, J. and Gearhart, A.}, title = {Brief announcement: communication bounds for heterogeneous architectures}, booktitle = {Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures}, publisher = {ACM}, address = {San Jose, California, USA}, month = {June}, year = {2011}, series = {SPAA '11}, pages = {257--258}, url = {http://doi.acm.org/10.1145/1989493.1989531} } |
|||||

G. Ballard, J. Demmel, O. Holtz and O. Schwartz,
newer version}
| |||||

Abstract: The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communication costs. For sequential algorithms these bounds are attainable and so optimal. | |||||

BibTeX:
@inproceedings{BDHS11-SLB, author = {G. Ballard and J. Demmel and O. Holtz and O. Schwartz}, title = {Graph expansion and communication costs of fast matrix multiplication}, booktitle = {{Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures}}, publisher = {ACM}, address = {San Jose, California, USA}, month = {June}, year = {2011}, series = {SPAA '11}, pages = {1--12}, url = {http://doi.acm.org/10.1145/1989493.1989495} } |
|||||

G. Ballard, J. Demmel, O. Holtz and O. Schwartz,
| |||||

BibTeX:
@techreport{BDHS11-TR, author = {G. Ballard and J. Demmel and O. Holtz and O. Schwartz}, title = {Minimizing communication in linear algebra}, month = {February}, year = {2011}, number = {EECS-2011-15}, url = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-15.html} } |
|||||

G. Ballard, J. Demmel, O. Holtz and O. Schwartz,
| |||||

Abstract: In 1981 Hong and Kung proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, n-by-n matrix multiplication using the conventional O(n^3) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo, and Tiskin gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as #arithmetic_operations/ √ M), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDLT factorization, QR factorization, the Gram--Schmidt algorithm, and algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth cost), we get lower bounds on the number of messages required to move it (latency cost). We extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix) to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or whether we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph-theoretic problems. We point out recently designed algorithms that attain many of these lower bounds. | |||||

BibTeX:
@article{BDHS11-GLB, author = {G. Ballard and J. Demmel and O. Holtz and O. Schwartz}, title = {Minimizing Communication in Numerical Linear Algebra}, journal = {SIAM Journal on Matrix Analysis and Applications}, publisher = {SIAM}, month = {September}, year = {2011}, volume = {32}, number = {3}, pages = {866--901}, url = {http://epubs.siam.org/doi/abs/10.1137/090769156} } |
|||||

G. Ballard, T. Kolda and T. Plantenga,
technical report}
| |||||

Abstract: The tensor eigenproblem has many important applications, generating both mathematical and application-specific interest in the properties of tensor eigenpairs and methods for computing them. A tensor is an m-way array, generalizing the concept of a matrix (a 2-way array). Kolda and Mayo have recently introduced a generalization of the matrix power method for computing real-valued tensor eigenpairs of symmetric tensors. In this work, we present an efficient implementation of their algorithm, exploiting symmetry in order to save storage, data movement, and computation. For an application involving repeatedly solving the tensor eigenproblem for many small tensors, we describe how a GPU can be used to accelerate the computations. On an NVIDIA Tesla C 2050 (Fermi) GPU, we achieve 318 Gflops/s (31% of theoretical peak performance in single precision) on our test data set. | |||||

BibTeX:
@article{BKP11, author = {G. Ballard and T. Kolda and T. Plantenga}, title = {Efficiently Computing Tensor Eigenvalues on a {GPU}}, journal = {2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, month = {May}, year = {2011}, volume = {0}, pages = {1340--1348}, url = {http://www.computer.org/portal/web/csdl/doi/10.1109/IPDPS.2011.287} } |
|||||

J. Baxley and G. Ballard,
| |||||

Abstract: Motivated by a problem which arises in the analysis of stagnation point flow toward a stretching sheet, we consider a general class of problems of the form y'''=f(y,y',y''), y(0)=a, y'(0)=m, t→∞ y'(t) = b. We give conditions on f which imply existence of at least one solution and obtain a partial uniqueness result. | |||||

BibTeX:
@article{BB11, author = {Baxley, J. and Ballard, G.}, title = {Existence of Solutions for a Class of Singular Nonlinear Third Order Autonomous Boundary Value Problems}, journal = {Communications in Applied Analysis}, month = {Oct}, year = {2011}, volume = {15}, number = {2, 3, and 4}, pages = {195--202}, url = {http://www.dynamicpublishers.com/CAA/caa2011.htm} } |
|||||

G. Ballard, J. Demmel, O. Holtz and O. Schwartz,
older version}
| |||||

Abstract: Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n^3)) matrix multiplication to Cholesky factorization, which is used for solving dense symmetric positive definite linear systems. Second, we compare the costs of various Cholesky decomposition implementations to these lower bounds and identify the algorithms and data structures that attain them. In the sequential case, we consider both the two-level and hierarchical memory models. Combined with prior results in [J. Demmel et al., Communication-optimal Parallel and Sequential QR and LU Factorizations, Technical report EECS-2008-89, University of California, Berkeley, CA, 2008], [J. Demmel et al., Implementing Communication-optimal Parallel and Sequential QR and LU Factorizations, SIAM. J. Sci. Comp., submitted], and [J. Demmel, L. Grigori, and H. Xiang, Communication-avoiding Gaussian Elimination, Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, 2008] this gives a set of communication-optimal algorithms for O(n^3) implementations of the three basic factorizations of dense linear algebra: LU with pivoting, QR, and Cholesky. But it goes beyond this prior work on sequential LU by optimizing communication for any number of levels of memory hierarchy. | |||||

BibTeX:
@article{BDHS10, author = {G. Ballard and J. Demmel and O. Holtz and O. Schwartz}, title = {Communication-optimal Parallel and Sequential {C}holesky Decomposition}, journal = {SIAM Journal on Scientific Computing}, publisher = {SIAM}, month = {December}, year = {2010}, volume = {32}, number = {6}, pages = {3495--3523}, url = {http://epubs.siam.org/doi/abs/10.1137/090760969} } |
|||||

G. Ballard, T. Kolda and T. Plantenga,
published version}
| |||||

BibTeX:
@techreport{BKP10-TR, author = {Ballard, G. and Kolda, T. and Plantenga, T.}, title = {Efficiently Computing Tensor Eigenvalues on a {GPU}}, month = {December}, year = {2010}, url = {http://csri.sandia.gov/Proceedings/} } |
|||||

G. Ballard and J. Baxley,
| |||||

Abstract: We study the Friedrichs extension for a class of 2nth order ordinary differential operators. These self-adjoint operators have compact inverses and the central problem is to describe their domains in terms of boundary conditions. | |||||

BibTeX:
@article{BB09, author = {Ballard, G. and Baxley, J.}, title = {The Friedrichs extension of certain singular differential operators, II}, journal = {Electronic Journal of Qualitative Theory of Differential Equations}, month = {Oct}, year = {2009}, volume = {Special Edition I}, number = {5}, pages = {1--11}, url = {http://www.emis.ams.org/journals/EJQTDE/sped1/105.html} } |
|||||

G. Ballard, J. Demmel, O. Holtz and O. Schwartz,
newer version}
| |||||

Abstract: Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n^3)) matrix multiplication to Cholesky, which is used for solving dense symmetric positive definite linear systems. Second, we compare the cost of various Cholesky implementations to this lower bound, and draw the following conclusions: (1) "Nave" sequential algorithms for Cholesky attain neither the bandwidth nor latency lower bounds. (2) The sequential blocked algorithm in LAPACK (with the right block size), as well as various recursive algorithms [AP00, GJ01, AGW01, ST04], and one based on work of Toledo [Tol97], can attain the bandwidth lower bound. (3) The LAPACK algorithm can also attain the latency bound if used with blocked data structures rather than column-wise or row-wise matrix data structures, though the Toledo algorithm cannot. (4) The recursive sequential algorithm due to [AP00] attains the bandwidth and latency lower bounds at every level of a multi-level memory hierarchy, in a "cache-oblivious" way. (5) The parallel implementation of Cholesky in the ScaLAPACK library (again with the right block-size) attains both the bandwidth and latency lower bounds to within a poly-logarithmic factor. Combined with prior results in [DGHL08a, DGHL08b, DGX08] this gives a complete set of communication-optimal algorithms for O(n^3) implementations of three basic factorizations of dense linear algebra: LU with pivoting, QR and Cholesky. But it goes beyond this prior work on sequential LU and QR by optimizing communication for any number of levels of memory hierarchy. | |||||

BibTeX:
@inproceedings{BDHS09, author = {Ballard, G. and Demmel, J. and Holtz, O. and Schwartz, O.}, title = {Communication-optimal parallel and sequential {C}holesky decomposition}, booktitle = {{Proceedings of the 22nd Symposium on Parallelism in Algorithms and Architectures}}, publisher = {ACM}, address = {Calgary, AB, Canada}, month = {June}, year = {2009}, series = {SPAA '09}, pages = {245--252}, url = {http://doi.acm.org/10.1145/1583991.1584054} } |
|||||

G. Ballard,
| |||||

BibTeX:
@mastersthesis{Ballard08, author = {Ballard, G.}, title = {Asymptotic behavior of the eigenvalues of Toeplitz integral operators associated with the Hankel transform}, month = {May}, year = {2008}, url = {http://wakespace.lib.wfu.edu/handle/10339/14676} } |
|||||

G. Ballard and J. Baxley,
| |||||

Abstract: We consider boundary value problems of the form y''=-f(t,y), y(0)=0, y(1)=0, motivated by examples where f(t,y)=t)g(y) and g(y) behave like y^-λ (0) as y→ 0^+. We explore conditions under which such problems have multiple positive solutions, investigate qualitative behavior of these solutions, and discuss computational methods for approximating the solutions. | |||||

BibTeX:
@article{BB08, author = {Ballard, G. and Baxley, J.}, title = {Qualitative behavior and computation of multiple solutions of singular nonlinear boundary value problems}, journal = {Involve}, month = {Feb}, year = {2008}, volume = {1}, number = {1}, pages = {21--31}, url = {http://msp.org/involve/2008/1-1/p03.xhtml} } |
|||||

G. Ballard,
| |||||

Abstract: In an interdisciplinary effort to model protein dependency networks, biologists measure signals from certain proteins within cells over a given interval of time. Using this time series data, the goal is to deduce protein dependency relationships. The mathematical challenge is to statistically measure correlations between given proteins over time in order to conjecture probable relationships. Biologists can then consider these relationships with more scrutiny, in order to confirm their conjectures. One algorithm for finding such relationships makes use of interpolation of the data to produce next-state functions for each protein and the Deegan-Packel Index of Power voting method to measure the strength of correlations between pairs of proteins. The algorithm was previously implemented, but limitations associated with the original language required the algorithm to be re-implemented in a more computationally efficient language. Because of the algebraic focus of the Computational Commutative Algebra language, or CoCoA, the algorithm was re-implemented in this language, and results have been produced much more efficiently. In this paper I discuss the algorithm, the CoCoA language, the implementation of the algorithm in CoCoA, and the quality of the results. | |||||

BibTeX:
@article{Ballard06, author = {Ballard, G.}, title = {Modeling Protein Dependency Networks Using CoCoA}, journal = {Crossroads}, publisher = {ACM}, address = {New York, NY, USA}, month = {September}, year = {2006}, volume = {13}, number = {1}, pages = {6--6}, url = {http://doi.acm.org/10.1145/1217666.1217672} } |
|||||

G. Ballard, J. Baxley and N. Libbus,
| |||||

Abstract: We consider nonlinear boundary value problems with multiple solutions. A method is proposed for the computation of such solutions which depends crucially on known a priori qualitative information about the behavior of the solutions. The method is a two-stage method where the second stage is a shooting method and initial values of the shooting parameters are found in a first stage which approximates the boundary value problem with a discrete approximation. Both nonsingular and singular problems are considered. | |||||

BibTeX:
@article{BBN06, author = {Ballard, G. and Baxley, J. and Libbus, N.}, title = {Qualitative behavior and computation of multiple solutions of nonlinear boundary value problems}, journal = {Communications on Pure and Applied Analysis}, month = {Mar}, year = {2006}, volume = {5}, number = {2}, pages = {251--259}, url = {http://www.aimsciences.org/journals/displayArticles.jsp?paperID=1769} } |