Community Online Office Hour

Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio

While adoption of the Cloud & Kubernetes has made it exceptionally easy to scale compute, the increasing spread of data across different systems and clouds has created new challenges for data engineers. Effectively accessing data from AWS S3 or on-premises HDFS becomes harder and data locality is also lost - how do you move data to compute workers efficiently, how do you unify data across multiple or remote clouds, and many more. Open source project Alluxio approaches this problem in a new way. It helps elastic compute workloads, such as Apache Spark, realize the true benefits of the cloud while bringing data locality and data accessibility to workloads orchestrated by Kubernetes. 

One important performance optimization in Apache Spark is to schedule tasks on nodes with HDFS data nodes locally serving the task input data. However, more users are running Apache Spark natively on Kubernetes where HDFS is not an option. This office hour describes the concept and dataflow with respect to using the stack of Spark/Alluxio in Kubernetes with enhanced data locality even the storage service is outside or remote.

In this Office Hour, we will go over:

  • Why Spark is able to make a locality-aware schedule when working with Alluxio in K8s environment using the host network;
  • Why a pod running Alluxio can share data efficiently with a pod running Spark on the same host using domain socket and host path volume;
  • The roadmap to improve this Spark / Alluxio stack in the context of K8s

Interested in learning more? 

Get access to the on demand video

Speaker: Bin Fan

Evangelist and Founding Member at Alluxio

Bin Fan is the founding engineer of Alluxio, Inc. and the PMC member of Alluxio open source project. Prior to Alluxio, he worked for Google where he won the Technical Infrastructure Award. Bin received his Ph.D. in Computer Science from Carnegie Mellon University working on distributed systems

Speaker: Jiacheng Liu

Software Engineer at Alluxio

Jiacheng Liu works in the core dev team at Alluxio. He majorly works on Alluxio integration with containerized environments. He has an M.S. from Columbia University. 

Alluxio is...

...a data orchestration layer for compute in any cloud. It unifies data silos on-premise and across any cloud to give you data locality, accessibility, and elasticity.

Whether it’s accelerating big data frameworks on the public cloud, running big data workloads in hybrid cloud environments, or enabling big data on object stores or multiple clouds, Alluxio reduces the complexities associated with orchestrating data for today’s big data and AI/ML workloads.