In the new newspaper ALX: Large Scale Matrix Factorization on TPUs, a Google Research team introduces ALX, an open-source library written in JAX that leverages Tensor Processing Unit (TPU) hardware accelerators to enable efficient distributed matrix factorization using alternating least squares. The team also released WebGraph, a large-scale link prediction dataset designed to encourage further research into techniques for managing very large-scale sparse matrices.
Matrix factorization is an effective basic technique widely used in recommender systems today. The successful implementation of large-scale matrix factorization could dramatically accelerate productivity in this growing field.
The proposed matrix factorization approach derives from the attractive properties of TPUs, which the team summarizes as follows:
- A TPU pod has enough distributed memory to store very large partitioned integration tables.
- TPUs are designed for workloads that can benefit from data parallelism, which is useful for solving large numbers of systems of linear equations, a basic operation for alternating least squares.
- The TPU chips are directly interconnected with dedicated, high-bandwidth, low-latency interconnects. This makes collect and scatter operations possible on a large distributed integration table stored in the memory of the TPU.
- Since any node failure can cause the training process to stop, traditional ML workloads require a highly reliable distributed setup, a requirement that a cluster of TPUs can fulfill.
The TPU properties mentioned above allow a large integration table to be shared across all available devices while avoiding replication or fault tolerance issues.
To make full use of the available TPU memory, the team presents a distributed matrix factorization algorithm using the Alternating Least Squares (ALS) approach for learning matrix factorization parameters. The method evenly distributes the user and item integration tables across the TPU cores. When a batch of data is transmitted from the host processor to connected TPU devices, multiple hosts (each connected to 8 TPU cores) are used in a pod configuration process so that the computational flow is identical and parallelized on separate batches transmitted to TPU devices. .
To perform large-scale evaluation experiments, the team created WebGraph, a large-scale link prediction dataset comprising Common Crawl data extracted from the Internet, as well as several variations of WebGraph based on the properties of locality and parsimony of the subgraphs. These datasets will also be open source.
The team analyzed the scaling properties of WebGraph variants in terms of training time as they increased the number of available TPU cores. Empirical results show that with 256 TPU cores, an epoch of the largest WebGraph variant, WebGraph-sparse (365M x 365M sparse matrix), takes about 20 minutes to complete, indicating that ALX can easily scale to matrices up to 1B x 1B in Size.
Overall, the study demonstrates the applicability of TPUs for accelerating large-scale matrix factorization. The Google team hopes their work will inspire further research and improvement on scalable methods and implementations of large-scale matrix factorization.
The paper ALX: Large Scale Matrix Factorization on TPUs is on arXiv.
Author: Hecate He | Editor: Michel Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Weekly Synchronized Global AI to get weekly AI updates.