Tensors are multidimensional arrays and can be used to represent data involving multiway relationships. For example, data sets that represent items and features that vary over time are naturally represented as a three-way tensor; data from physical experiments or simulations that track multiple variables over space and time are also natural higher-order tensors. Tensor factorization is an elemental form of unsupervised machine learning that decomposes data into latent factors that are often directly interpretable or can be used as input to subsequent analysis.
The size and complexity of today’s data sets are making traditional data analytics techniques, which are typically designed for single processors or workstations, infeasible in terms of memory and computation time. The machine learning and data mining communities are beginning to require high performance computing techniques to solve their problems of interest. In particular, while mathematical approaches for tensor factorizations have been developing over the past several years, research into the efficiency of fundamental tensor computations is only recently garnering attention. The goal of this project is to leverage the wealth of algorithmic techniques and implementations developed in the field of high performance linear algebra and apply that expertise to tensor computations, enabling data analytics of cutting-edge multidimensional data.
Strassen surprised the field of computer science in 1969 with a simple recursive algorithm for dense matrix multiplication that requires asymptotically fewer arithmetic operations than the traditional approach. For multiplying two square matrices of dimension n, his algorithm requires O(n^{2.81}) operations, while the traditional dot-product based algorithm requires O(n^{3}). His discovery sparked decades of research into “fast” matrix multiplication algorithms in the fields of pure mathematics, theoretical computer science, and high performance computing that continues today. There exists a theoretical goal of discovering ω, the true exponent of matrix multiplication, and at the same time, there exists a practical goal of leveraging fast algorithms to benefit the many applications that rely on dense matrix multiplication as a computational kernel.
The goal of this project is to connect the theoretical and practical research on the topic to discover and implement practical fast algorithms so that they can benefit applications. Strassen’s original algorithm has been demonstrated to decrease run times of matrix multiplications of reasonable sizes, even compared against heavily tuned library implementations of the classical algorithm. Recently, a wealth of new algorithms have been discovered, and there is renewed interest both in demonstrating that these algorithms are also practical and in using computer-aided search to discover new algorithms. While this problem is one of the most intensely studied in computer science, modern computational capabilities are opening up new possibilities for research breakthroughs.
The traditional metric for the efficiency of an algorithm has been the number of arithmetic operations it performs. Technological trends have long been reducing the time to perform an arithmetic operation, so it is no longer the bottleneck in many algorithms; rather, communication, or moving data, is the bottleneck. This motivates us to seek algorithms that move as little data as possible, either between levels of a memory hierarchy or among parallel processors over a network.
The idea of communication-avoiding algorithms is to reformulate existing approaches to solving fundamental problems in order to significantly reduce the amount of data movement. Using abstract machine models, we strive to prove communication lower bounds, determining the minimum amount of data movement any algorithm must perform. Then we can compare the communication costs of existing approaches to the lower bounds to determine if algorithmic improvements are possible. If there exists a gap between best algorithm and best known lower bound, the goal becomes to close the gap, usually with an innovative algorithm. In this case, the ultimate goal is to implement the algorithm on state-of-the-art hardware to achieve practical performance improvements for today’s applications.