Parallel Algorithms for Tensor Computations and their Applications

SIAM Conference on Parallel Processing for Scientific Computing 2022

The SIAM PP22 conference took place February 23-26 virtually. Grey Ballard, Drew Lewis, Jeremy Myers, and Eric Phipps organized a three-part minisymposium on recent advances in the algorithms, parallel implementations and the applications of tensor decomposition. The conference website is here. The speakers have agreed to share their slides on this webpage.

Abstract. Tensors, or multidimensional arrays, are a natural way to represent high-dimensional data arising in a multitude of applications. Tensor decompositions, such as the CANDECOMP/PARAFAC, Tucker, and Tensor Train models, help to identify latent structure, achieve data compression, and enable other tools of scientific and data analysis. This minisymposium explores recent advances in algorithms for computing tensor decompositions, parallel algorithms for computing key tensor decomposition kernels, and applications of these methods to scientific and data analysis use-cases.

Organizers

A Distributed Memory Implementation of CP-POPT-GDGN: An All-at-Once Decomposition Algorithm for Count Tensors

Teresa Ranadive (Laboratory for Physical Sciences)

Abstract. Data sets from many applications can be represented as count tensors. As such tensors grow larger and increasingly elaborate, more sophisticated algorithms are required to enable accurate CP decompositions of these tensors. Recent work has demonstrated that CP-POPT-GDGN, an all–at–once algorithm for count tensor decomposition that utilizes a generalized damped Gauss-Newton method, typically achieves more accurate decompositions than earlier count tensor decomposition algorithms, including CP-APR. In this work, we consider a distributed memory implementation of CP-POPT-GDGN, and discuss its scalability. This implementation allows us to compute CP decompositions in an amount of time comparable to that required by a distributed memory implementation of CP--APR, when using the same computing resources.

A New Alternating Optimization Algorithm for CP Decomposition

Navjot Singh (University of Illinois at Urbana-Champaign)

Abstract. CP decomposition (CPD) is prevalent in chemometrics, signal processing, data mining and many more fields. While a plethora of algorithms have been proposed to compute the CPD, alternating least squares (ALS) still remains one of the most widely used algorithm for computing the decomposition. Recent works have shown that ALS suffers from slow convergence when the CP factors become collinear (also known as the swamp phenomenon). Several attempts have been made to overcome this issue through incorporating techniques like line search, pivoting in the ALS algorithm. We propose a new algorithm which minimizes a different objective function via alternating optimization leading to a sequence of better conditioned iterates of the CP factors, alleviating the swamp phenomenon. Alternating optimization of this new objective leads to simple updates to the factor matrices with the same asymptotic computational cost as ALS. We show that a subsweep of this algorithm can achieve a superlinear convergence rate for CPD with known rank and verify it experimentally. We further introduce a formulation which generalizes our approach to interpolate between updates corresponding to the ALS and the new algorithm to manage the tradeoff between stability and fitness of the decomposition for approximation problems. Our experimental results show that for approximate CPD, this algorithm and its variants converge to a better conditioned decomposition with a small change in fitness as compared to the ALS.

Fully Constrained PARAFAC2 with AO-ADMM

Marie Roald (Simula Metropolitan Center for Digital Engineering and Oslo Metropolitan University)

Abstract. The PARAFAC2 model is a powerful method to analyse multiway measurements with variations across one mode. The model captures latent factors and allows the factors of one mode to evolve across another, which has proven beneficial for applications in several domains, such as chemometrics, neuroscience and data mining. For example, the PARAFAC2 model can accurately describe chromatographic data despite retention shifts and shape changes of the temporal profiles across samples. A challenge with PARAFAC2, however, is the constant cross-product constraint, which is necessary to ensure a unique decomposition. To impose this constraint, the evolving factor matrices are traditionally modelled implicitly as the product of a shared “coordinate matrix” and a set of orthogonal basis matrices. However, this implicit handling makes it difficult to impose constraints on the evolving factor matrices. In this work, we instead reformulate the regularised PARAFAC2 problem in a form suited for alternating optimisation (AO) with the alternating direction method of multipliers (ADMM), thus making it possible to fit PARAFAC2 models with any proximable constraint on all modes. Using simulated datasets, we show that the proposed scheme is accurate, flexible in terms of constraints and computationally efficient. Finally, we use our method in a real-world chromatography application and show that constraining all modes to be non-negative improves the interpretability of the extracted components.

Practical, Out-of-Memory Sparse MTTKRP on Massively Parallel Architectures

Andy Nguyen (University of Oregon)

Abstract. We propose a novel frame-work for massively parallel execution of fundamental tensor decomposition operations. Our new Blocked Linearized CoOrdinate (BLCO) format enables out-of-memory computation of the matricized tensor times Khatri-Rao product (MTTKRP) for every tensor dimension, using a unified implementation that works on a single tensor copy. To address the substantial synchronization cost on massively parallel architectures, we introduce a hierarchical on-the-fly algorithm to efficiently discover and resolve update conflicts across threads without keeping any auxiliary information or storing non-zero elements according to specific mode orientations. As a result, our BLCO-based MTTKRP not only handles out-of-memory tensors, but also delivers superior in-memory performance compared to mode-specific approaches. On the latest NVIDIA A100 GPU, BLCO achieves2.9× geometric-mean speedup over the state-of-the-art mixed-mode com-pressed sparse fiber (MM-CSF) on a range of real-world sparse tensors.

Multi-Start Poisson Tensor Factorization

Jeremy Myers (College of William & Mary)

Abstract. The Poisson Tensor Factorization (PTF) is a special case of the CP decomposition applied to low-rank, multilinear structure via Poisson maximum likelihood estimation. Owing to the nonlinearity and non-convexity of the underlying problem, it is important to use a multi-start strategy to improve the chance of recovering the maximum likelihood estimator rather than suboptimal local solutions. We compare scalability-accuracy trade-offs among state-of-the-art methods for solving the PTF problem on real problems in the multi-start context.

Parallel Randomized Algorithms for Tucker Decompositions

Rachel Minster (Wake Forest University)

Abstract. The Tucker decomposition is a low-rank tensor decomposition generalizing the matrix SVD to higher dimensions. Traditional algorithms used to compute Tucker decompositions can be computationally expensive as they involve computing the SVD of mode unfolding, which can be quite large. Existing parallel and randomized algorithms have reduced this cost, but computational challenges remain. We propose new parallel randomized algorithms to address these challenges. Using randomized matrix techniques, we accelerate a distributed-memory implementation of the Tucker decomposition to obtain an efficient algorithm that avoids communication. Specifically, we employ a new sketching method that exploits Kronecker structure to accelerate a key computation. We also present probabilistic analysis of the error resulting from the algorithm, as well as numerical results demonstrating the computational benefits of this approach.

FIST-HOSVD: Fused Inplace Sequentially Truncated Higher Order Singular Value Decomposition

Benjamin Cobb (Georgia Institute of Technology)

Abstract. Several novel methods of improving the memory locality of the Sequentially Truncated Higher Order Singular Value Decomposition (ST-HOSVD) algorithm for computing the Tucker decomposition are presented. We show how the two primary computational bottlenecks of the ST-HOSVD can be fused together into a single kernel to improve memory locality. We then extend matrix tiling techniques to tensors to further improve cache utilization. This blocked based approach is then leveraged to drastically reduce the auxiliary memory requirements of the algorithm. Our approach's effectiveness is demonstrated by comparing benchmark results between the traditional ST-HOSVD kernels and our single fused kernel. We then compare single-node CPU runtime results of a ST-HOSVD implementation utilizing our optimizations to TuckerMPI. We demonstrate ∼2× speedup over the existing state-of-the-art for dense high rank tensors, whilst increasing the problem size that can be computed for a given memory allocation by ∼3×. Finally, we demonstrate our approach's effectiveness with a motivating example of compressing Homogeneous Charge Compression Ignition (HCCI) and Stat Planar (SP) simulation datasets.

Parallel Tensor Train Rounding using Gram SVD

Grey Ballard (Wake Forest University)

Abstract. Tensor Train (TT) is a low-rank tensor representation consisting of a series of three-way cores whose dimensions specify the TT ranks. Formal tensor train arithmetic often causes an artificial increase in the TT ranks. Thus, a key operation for applications that use the TT format is rounding, which truncates the TT ranks subject to an approximation error guarantee. Truncation is performed via SVD of a highly structured matrix, and current rounding methods require careful orthogonalization to compute an accurate SVD. We propose a new algorithm for TT rounding based on the Gram SVD algorithm that avoids the expensive orthogonalization phase. Our algorithm performs less computation and can be parallelized more easily than existing approaches, at the expense of a slight loss of accuracy. We demonstrate that our implementation of the rounding algorithm is efficient, scales well, and consistently outperforms the existing state-of-the-art parallel implementation in our experiments.

Memory-Efficient Tensorized Embedding Layers for Neural Networks

Chunxing Yin (Georgia Institute of Technology)

Abstract. Tensor-trains have been proposed to reduce the memory requirements of embedding tables in industrial-scale deep neural networks, including recommendation models and graph neural networks. This talk summarizes the tensor-trains use cases and implementation challenges when training large embeddings for such models. In this context, we are developing a new prototype infrastructure of performance-optimized kernels, TT-EmbeddingBag, that aims to reduce model sizes while preserving model accuracy and incurring only modest increases in training-time over uncompressed baselines.

Robust Approximation of Tensor Networks, and Its Applications in Quantum Chemistry

Edward F. Valeev (Virginia Tech)

Abstract. Approximation of a tensor network by approximating (e.g., factorizing) one or more of its constituent tensors can be improved by canceling the leading-order error due to the constituents' approximation. The utility of such robust approximation is demonstrated for robust canonical polyadic (CP) approximation of a (density-fitting) factorized 2-particle Coulomb interaction tensor. The resulting algebraic (grid-free) approximation for the Coulomb tensor, closely related to the factorization appearing in pseudospectral and tensor hypercontraction approaches, is efficient and accurate, with significantly reduced rank compared to the naive (non-robust) approximation. Application of the robust approximation to the particle-particle ladder term in the coupled-cluster singles and doubles reduces the size complexity from O(N6) to O(N5) with robustness ensuring negligible errors in chemically-relevant energy differences using CP ranks approximately equal to the size of the density-fitting basis. Preliminary applications to other electronic structure methods will also be discussed.

Structured Matrix Approximations via Tensor Decompositions

Arvind Saibaba (North Carolina State University)

Abstract. We provide a computational framework for approximating a class of structured matrices (e.g., block Toeplitz, block banded). Our approach has three steps: map the structured matrix to tensors, use tensor compression algorithms, and map the compressed tensors back to obtain two different matrix representations --- sum of Kronecker products and block low-rank format. The use of tensor decompositions enable us to uncover latent structure in the matrices and lead to computationally efficient algorithms. The resulting matrix approximations are memory efficient, easy to compute with, and preserve the error due to the tensor compression in the Frobenius norm. While our framework is quite general, we illustrate the potential of our method on structured matrices from three applications: system identification, space-time covariance matrices, and test matrices from SuiteSparse collection. See preprint: https://arxiv.org/abs/2105.01170.

Parallel Memory-Efficient Computation of Symmetric Higher-Order Joint Moment Tensors

Zitong Li (Wake Forest University)

Abstract. The decomposition of higher-order joint cumulant tensor of spatial-temporal data sets is useful in analyzing multi-variate non-Gaussian statistics with a wide variety of applications (e.g. anomaly detection, independent component analysis, dimensionality reduction). Computing the cumulant tensor often requires computing the joint-moment tensor of the input data first, which is very expensive using a naïve algorithm. The current state-of-the-art algorithm for computing joint moment tensors takes advantage of the symmetric nature of a moment tensor by dividing it into smaller cubic tensor blocks and only compute the blocks with unique values and thus reducing computation. We propose an improvement over this algorithm by posing its computation as matrix operations, specifically Khatri-Rao products and standard matrix multiplications. Because this approach is much more cache efficient, we expect considerable speedup in a single processor. We implemented our algorithm in Julia and in Matlab and compared it against the state-of-the-art approach. The results show a speedup of up to 10x.