GREY BALLARD

MPI-ATTAC: Algorithms for Tensor Train Arithmetic and Computations

The software provides parallel algorithms for performing arithmetic on tensors in tensor train (TT) format. The arithmetic operations include addition, elementwise multiplication, and norm computation, which are performed in way that exploits and maintains the TT structure of the tensors. The truncation of TT ranks is computed via a rounding routine that produces an approximation satisfying a specified error tolerance. It is written in C and MPI and relies on BLAS and LAPACK.

PLANC: Distributed NMF Library

PLANC solves the nonnegative matrix factorization problem: approximating a matrix as a product of two matrices whose entries are nonnegative. The code is written in C++ and uses OpenMP for shared-memory parallelism and MPI for distributed-memory parallelism. It incorporates multiple alternating-updating techniques, including multiplicative updates, hierarchical alternating least squares, and block principal pivoting. It uses Armadillo to interface with BLAS and LAPACK libraries.

TuckerMPI: Parallel Tucker Tensor Decomposition

TuckerMPI computes the Tucker decomposition of dense tensors using the Sequentially Truncated Higher-Order Singular Value Decomposition algorithm. It is designed for use in distributed memory but can also be used on a single node. The code is written in C++ with MPI. It also relies on BLAS and LAPACK libraries.

MADS: Maximum All-Pairs Dot Product Search

This software contains Matlab and C implementations for computing only the largest entries of a matrix-matrix product. The C implementations rely on a mex interface to MATLAB and use subroutines from the CSparse library. The main algorithm is randomized, based on wedge or diamond sampling, but it can also compute the largest entries deterministically (with reduced memory footprint).

Fast-Matmul: Practical Fast Matrix Multiplication

This software contains implementations of fast matrix multiplication algorithms (those that require fewer than the classical O(n^3) arithmetic operations) for sequential and shared-memory parallel environments. The code is written in C++ and can incorporate any fast algorithm that can be expressed by three matrices U, V, and W. It relies on BLAS's DGEMM for the implementation of the classical algorithm.

CA-SBR: Communication-Avoiding Successive Band Reduction

This code is used to help solve the symmetric eigenvalue problem by using similarity transformations to reduce a symmetric band matrix to symmetric tridiagonal form (maintaining the eigenvalues). It is written in C and uses OpenMP for shared-memory parallelization. It also includes MATLAB code for visualization of the algorithm.

Other Software

I've also made minor contributions to the following libraries: