Start United States USA — software Accelerate the End-to-End Machine Learning Training Pipeline by Optimizing I/O

Accelerate the End-to-End Machine Learning Training Pipeline by Optimizing I/O

Von

January 30, 2022

117

This article is the first in a series introducing the architecture and solution to accelerate machine learning model training.
Join the DZone community and get the full member experience. This article is the first in a series introducing the architecture and solution to accelerate machine learning model training. The next article compares traditional solutions and explains how this new approach differs. With artificial intelligence (AI) and machine learning (ML) becoming more pervasive and business-critical, organizations are advancing their AI/ML capabilities and broadening the use and scalability of AI/ML applications. These AI/ML applications require data platforms to meet the following specific requirements: AI/ML pipelines are composed of not only model training and inference but also include data loading and preprocessing steps as a precursor, which have a major impact. In the data loading and preprocessing phase, AI/ML workloads often make more frequent I/O requests to a larger number of smaller files than traditional data analytics applications. Having a better I/O efficiency can dramatically increase the speed of the entire pipeline. Model training is compute-intensive and requires GPUs to enable data to be processed quickly and accurately. Because GPUs are expensive, optimal utilization is critical. However, when utilizing GPUs, I/O becomes the bottleneck – workloads are bound by how fast data can be made available to the GPUs and not how fast the GPUs can perform training calculations. Data platforms need high throughput and low latency to fully saturate the GPU clusters to reduce the cost. As the volume of data keeps growing, it gets more and more challenging for organizations to only have a single storage system. A variety of storage options are being used across business units, including on-premises distributed storage systems (HDFS, Ceph) and cloud storage (AWS S3, Azure Blob Store, Google Cloud Storage). Having access to all the training data spanning in different environments is necessary to make models more effective. In addition to supporting different storage systems, data platforms also need to support different deployment models. As the volume of data grows, cloud storage has become a popular choice with high scalability, reduced cost, and ease of use. Organizations want flexibility and openness to training models by leveraging available cloud, hybrid, and multi-cloud infrastructure. Also, the growing trend of separating compute resources from storage resources necessitates using remote storage systems. However, when storage systems are remote, data must be fetched over the network, bringing performance challenges. Data platforms need to achieve high performance while accessing data across heterogeneous environments. In summary, today’s AI/ML workloads demand fast access to expansive amounts of data at a low cost in heterogeneous environments. Organizations need to modernize the data platform to enable those workloads to effectively access data, maintain high throughput and utilize the GPU resources used to train models. Alluxio is an open-source data orchestration platform for analytics and machine learning applications. Alluxio not only provides a distributed caching layer between training jobs and underlying storage but also is responsible for connecting to the under storage, fetching data proactively or on-demand, caching data based on user-specified policy, and feeding data to the training frameworks. The typical architecture is shown below: You can use Alluxio to accelerate machine learning and deep learning training includes the following three steps: Here, we discuss the basic features and benefits of this architecture. By caching data locally or closer to the training jobs, Alluxio provides high I/O throughput, which prevents unnecessarily low GPU utilization while waiting for fetching data. Instead of duplicating the entire dataset into every single machine, Alluxio implements a shared distributed caching service, where data can be evenly distributed across the cluster. This can greatly improve storage utilization especially when the training dataset is much larger than the storage capacity of a single node. This is shown in the following figure: In this figure, the whole dataset is stored in an object-store.