SIAM Journal on Scientific Computing vol:37 issue:3 pages:C415-C438
The construction of the gradient of the objective function in gradient-based optimization algorithms for computing an r-term CANDECOMP/PARAFAC (CP) decomposition of an unstructured dense tensor is a key computational kernel. The best technique for efficiently implementing this operation has a memory consumption that scales linearly with the number of terms r and sublinearly with the number of elements of the tensor. We consider a blockwise computation of the CP gradient, reducing the memory requirements to a constant. This reduction is achieved by a novel technique that we call implicit block unfoldings, which combines the benefits of the block tensor unfoldings by [Ragnarsson and Van Loan, Block tensor unfoldings, SIAM J. Matrix Anal. Appl. 33(1), 2012] and the implicit unfoldings of [Phan, Tichavsky, and Cichocki, Fast alternating LS algorithms for high order CANDECOMP/PARAFAC tensor factorizations, IEEE Trans. Signal Process., 61, 2013]. A heuristic algorithm for automatically choosing the division into subtensors is part of the proposed algorithm. The throughput that can be attained is essentially determined by the performance of a matrix product of two small matrices of constant size. Numerical experiments illustrate that the proposed method can outperform the current state-of-the-art by up to two orders of magnitude for large dense tensors in terms of memory consumption, while the increase of the execution time is no more than 5%. The proposed algorithm attained upward of 90% of the theoretical peak performance of the computer system, using no more than 50MB of memory, irrespective of the size of the tensor and the number of terms r.