Today machine learning and data analytics applications are widely deployed using Kubernetes in a public cloud. However, there are challenging to be addressed:
- I/O overhead for accessing data with disaggregated storage to improve training performance and scale compute
- Cost overhead for the alternative of adding a high speed distributed storage
- Management overhead for data spread across multiple cloud object storage systems (i.e. S3, Azure, Google, Alibaba Cloud and etc)
In this talk, we will share our experience implementing an effective data orchestration solution in Kubernetes for Machine learning workloads with the Alluxio in Alibaba Cloud. The solution greatly improves the performance with 32 GPUs for distributed deep learning training with ResNet50 using imageNet Dataset. Benchmark results show a speedup of at-least 3x at no extra cost.
We will talk about:
- Industry trends embracing the cloud for I/O hungry ML on GPUs
- An architecture to run Tensorflow/Pytorch jobs with Alluxio in Alibaba Cloud
- How to deploy, operate, and manage Alluxio in Kubernetes for ML workfloads
- Our future work to make the user work with it in a transparent way.
Interested in learning more?
Save your spot
Online Meetup | Speeding up Deep Learning in the Cloud with Alluxio + Kubernetes
Thursday, September 10
Yang Che is a Staff Engineer at Alibaba Cloud. He works in the Container Service for Kubernetes (ACK) team and focuses on Kubernetes and container related product development. Yang also works on building an elastic machine learning platform with cloud-native technology. He is the main author and core maintainer of GPU share Scheduler, and an active contributor at communities like Kubernetes, Docker, and Kubeflow.
Sr. Architect, Alibaba Cloud
Speaker: Yang Che
Bin Fan is the founding engineer and VP of Open Source at Alluxio, Inc. Prior to Alluxio, he worked for Google to build the next-generation storage infrastructure. Bin received his Ph.D. in Computer Science from Carnegie Mellon University on the design and implementation of distributed systems.
Speaker: Bin Fan
Founding Engineer & VP of OS, Alluxio
...a data orchestration layer for compute in any cloud. It unifies data silos on-premise and across any cloud to give you data locality, accessibility, and elasticity.
Whether it’s accelerating big data frameworks on the public cloud, running big data workloads in hybrid cloud environments, or enabling big data on object stores or multiple clouds, Alluxio reduces the complexities associated with orchestrating data for today’s big data and AI/ML workloads.