Community Online Office Hour

Improving Memory Utilization of Spark Jobs Using Alluxio

Get access to the on demand video

Apache Spark has been widely adopted for in-memory data analytics at scale, however, efficient memory utilization is a common challenge, and users will either run out of memory or experience low and unstable performance. Many Spark users may not be aware of the differences in memory utilization between caching data directly in-memory into the Spark JVM versus storing data off-heap via an in-memory storage service like Alluxio. In this office hour, I will highlight the two approaches with a demo and open up for discussions

In this Office Hour we'll go over:

  • How to run Spark shell with Alluxio such that Spark jobs 
  • A demo to compare the memory usage between Spark cache and using Alluxio as the external off-heap caching service 
  • Open Session for discussion on any topics such as running Presto on Alluxio, and more

Interested in learning more? 

Speaker: Bin Fan

Evangelist and Founding Member at Alluxio

Bin Fan is the founding engineer of Alluxio, Inc. and the PMC member of Alluxio open source project. Prior to Alluxio, he worked for Google where he won the Technical Infrastructure Award. Bin received his Ph.D. in Computer Science from Carnegie Mellon University working on distributed systems

Alluxio is...

...a data orchestration layer for compute in any cloud. It unifies data silos on-premise and across any cloud to give you data locality, accessibility, and elasticity.

Whether it’s accelerating big data frameworks on the public cloud, running big data workloads in hybrid cloud environments, or enabling big data on object stores or multiple clouds, Alluxio reduces the complexities associated with orchestrating data for today’s big data and AI/ML workloads.