High Performance Tensor Computations

MS58 and MS69 at SIAM Annual Meeting 2017

The 2017 SIAM Annual Meeting took place July 10-14 in Pittsburgh, PA, USA. We organized a two-part minisymposium on tensor decompositions and high performance tensor computations. The conference website is here, and more information on the minisymposium is available here and here. The speakers have agreed to share their slides on this webpage.

Abstract. Tensor decompositions are increasingly useful in data science. We consider tensor decompositions such as CP, Tucker, as well as coupled decompositions. This minisymposium focuses especially on effective algorithms for data science problems as well as considering high performance implementations for key tensor computations.

Organizers

Generalized Canonical Tensor Factorization for Binary, Count, Nonnegative, or Real-Valued Data

Tammy Kolda (Sandia National Laboratories)

Abstract. We consider the generalization of the canonical polyadic tensor factorization (also known as CANDECOMP, PARAFAC, or CP) to different goodness-of-fit metrics. The standard model measures fit via a squared error metric. We allow the user to generalize to other metrics that may be more appropriate for different types such as count data, binary data, or nonnegative data. This leads to a general framework for fitting the model. We show how the model can be extended to handle missing data and solved efficiently using randomized optimization methods.

Optimizing Distributed Tucker Decomposition

Venkat Chakaravarthy (IBM Corporation)

Abstract. The Tucker decomposition generalizes the classical Singular Value Decomposition (SVD) to higher dimensional tensors. The HOSVD/HOOI methods provide a popular mechanism for computing the Tucker decomposition of a given tensor. Prior work has presented distributed implementations of these methods for both dense and sparse tensors. We shall discuss optimization strategies for both settings with the aim of reducing the computational load and the communication volume. For the dense case, our strategies are provably optimal and our experimental evaluation demonstrates that the strategies yield up to 7x performace gain over prior implementations on practical tensors.

Decomposing Large-Scale Tensors into Rank-1 Terms Using Randomized Block Sampling

Nico Vervliet (KU Leuven)

Abstract. To interpret large-scale datasets, simple structures are often assumed. We focus on the decomposition of tensors into rank-1 terms, also called the (canonical) polyadic decomposition (CPD). Optimization-based algorithms used to compute the CPD are hindered by the curse of dimensionality. Many techniques based on, e.g., compression, incompleteness or parallel computation have been proposed to alleviate this curse.

In this talk, we demonstrate a new strategy based on randomized block sampling. Each iteration, a random block is sampled from the tensor and a single update is computed from this block. Because of the locality property of a CPD, a block affects only few variables and updates can be computed efficiently. This process of sampling and updating is repeated up to convergence which can be determined using a novel stopping criterion based on the Cramer--Rao bound. We show how a carefully chosen step restriction schedule allows a tensor to be decomposed up to almost the same accuracy as if a state-of-the-art algorithm for a full tensor was used, while using only a fraction of the data and time compared to full tensor algorithms. The choice of this step restriction schedule can be done experimentally or using a new automated procedure.

Finally, we show that, using randomized block sampling, a standard laptop suffices to decompose synthetic tensors of 8TB and real-life hazardous gas classification data within minutes while achieving high accuracy.

A Practical Randomized CP Tensor Decomposition

Casey Battaglino (Georgia Institute of Technology)

Abstract. The CANDECOMP/PARAFAC (CP) decomposition is a leading method for the analysis of multiway data. The standard alternating least squares algorithm for the CP decomposition (CP-ALS) involves a series of highly overdetermined linear least squares problems. We extend randomized least squares methods to tensors and show the workload of CP-ALS can be drastically reduced without a sacrifice in quality. We introduce techniques for efficiently preprocessing, sampling, and computing randomized least squares on a dense tensor of arbitrary order, as well as an efficient sampling-based technique for checking the stopping condition. We also show more generally that the Khatri-Rao product (used within the CP-ALS iteration) produces conditions favorable for direct sampling. In numerical results, we see improvements in speed, reductions in memory requirements, and robustness with respect to initialization.

Model-Driven Sparse CP Decomposition for Higher-Order Tensors

Jiajia Li (Georgia Institute of Technology)

Abstract. Given an input tensor, its CANDECOMP/PARAFAC decomposition (or CPD) is a low-rank representation. CPDs are of particular interest in data analysis and mining, espe- cially when the data tensor is sparse and of higher order (dimension). This paper focuses on the central bottleneck of a CPD algorithm, which is evaluating a sequence of matricized tensor times Khatri-Rao products (MTTKRPs). To speed up the MTTKRP sequence, we propose a novel, adaptive tensor memoization algorithm, AdaTM. Besides removing redundant computations within the MTTKRP sequence, which potentially reduces its overall asymptotic complexity, our technique also allows a user to make a space-time tradeoff by automatically tuning algorithmic and machine parameters using a model-driven framework. Our method improves as the tensor order grows, making its performance more scalable for higher-order data problems. We show speedups of up to 8× and 820× on real sparse data tensors with orders as high as 85 over the SPLATT package and Tensor Toolbox library respectively; and on a full CPD algorithm (CP-ALS), AdaTM can be up to 8× faster than state-of-the-art method implemented in SPLATT.

Tensor Analysis: Applications and Algorithms

Christos Faloutsos (Carnegie Mellon University)

Abstract. Can tensors help us understand fMRI brainscans? How about computer network intrusion detection? We focus on the multiple applications that tensors can help us analyze, including co-evolving time sequences, power-grid measurements, and more. We also give a recent algorithm for higher order tensors (4 or higher), that achieves several orders of magnitude savings in space and time, and the 'groupNteach' algorithm, connecting education to matrix algebra and information theory.

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Chris Forster (Sandia National Laboratories)

Abstract. Tensors have found utility in a wide range of applications, such as chemometrics, network traffic analysis, neuroscience, and signal processing. Many of these applications have increasingly large amounts of data to process and require high-performance methods to provide a reasonable turnaround time for analysts. In this work, we consider decomposition of sparse count data using CANDECOMP-PARAFAC alternating Poisson regression (CP-APR) with both multiplicative update and quasi-Newton methods. For these methods to remain effective on modern large core count CPU, Many Integrated Core (MIC), and Graphics Processing Unit (GPU) architectures, it is essential to expose thread- and vector-level parallelism and take into account the memory hierarchy and access patterns for each device to obtain the best possible performance. In this presentation, we will discuss the optimization and observed performance of the methods on modern high-performance computing architectures using the Kokkos programming model, overhead incurred by portability, and implications for upcoming distributed solver development.

Blocking Optimization Strategies for Sparse Tensor Computation

Jee Choi (IBM T.J. Watson Research Center)

Abstract. Many social and scientific domains give rise to data with multi-way relationships that can naturally be represented by tensors, or multi-dimensional arrays. Decomposing – or factoring – tensors can reveal latent properties that are otherwise difficult to see. However, due to the relatively recent rise in popularity of tensor decomposition in HPC, its challenges in performance optimization is poorly understood. In this presentation, I will explain the steps taken to identify and isolate the major bottlenecks in tensor decomposition algorithms, and demonstrate significant speedup over prior state-of-the-art using various cache blocking mechanisms.