MINT

Intelligent Management of Hybrid Workloads for Extreme Scale Computing

The high-performance computing (HPC) community is embracing artificial intelligence (AI) techniques for countless pursuits, from driving ground-breaking scientific discoveries to protecting our national security. As newly emerging machine learning and data-centric workloads proliferate in HPC, current workload-management systems cannot keep up with the significant challenges introduced by the diverse mix of applications co-running on heterogeneous systems. This project tackles the problem by developing an intelligent workload-management framework named MINT (Multi-resource INtelligenT management) in which distinctive computational resource requirements of hybrid workloads will be automatically identified and fulfilled to achieve extreme resource efficiency and satisfactory user experience. The project will develop fundamental improvements in HPC workload management to promote the use of large-scale supercomputers for emerging data-centric applications (HPC4AI). Meanwhile it will exploit advanced AI technologies, especially multi-objective reinforcement learning, to empower job scheduling and resource allocation in HPC (AI4HPC). Key research thrusts include understanding performance implications of diverse workloads on supercomputers via model-driven analysis, new intelligent multi-resource scheduling methods, smart resource-allocation strategies for minimal workload interference, and extensive evaluation of the proposed framework through trace-based simulation and testing.

Faculty:

  • Zhiling Lan (PI)

Graduate Students:

  • Yuping Fan (PhD, graduated in 12/2021)
  • Boyang Li (PhD, graduated in 8/2023)
  • Matthew Dearing (PhD)
  • Zhong Zheng (PhD)
  • Melanie Cornelius (PhD)
  • Shambhawi Sharma (PhD)
  • Riccardo Strina (MS, graduated in 8/2024)
  • Alessandro Martinolli (MS)
  • Pietro Lodi Rizzini (MS)
  • Yiheng Tao (MS)

Collaborators:

  • Mike Papka (ANL/UIC)
  • Bill Allcock (ANL)
  • Paul Rich (ANL)

Key Publications

  • M. Dearing, Y. Tao, X. Wu, Z., and V. Taylor, “LASSI: An LLM-based Automated Self-Correcting Pipeline for Translating Parallel Scientific Codes”, International Workshop on Large Language Models and HPC (LLMxHPC), 2024.
  • Li, Z. Lan, and M. Papka, “Interpretable Modeling of Deep Reinforcement Learning Driven Scheduling”, MASCOTS, 2023.
  • Y. Kang, X. Wang, and Z. Lan, “Workload Interference Prevention with Intelligent Routing and Flexible Job Placement on Dragonfly”, ACM SIGSIM-PADS’23, 2023.
  • Y. Kang, X. Wang, and Z. Lan, “Mitigating Network Contention with Intelligent Routing”, ACM/IEEE SC, 2022.
  • B. Li, M. Dearing, Y. Fan, P. Rich, B. Allcock, M. Papka, and Z. Lan, “MRSch: Multi-Resource Scheduling for HPC”, IEEE Cluster, 2022.
  • B. Li, Y. Fan, M. Papka, and Z. Lan, “Encoding for Reinforcement Learning Driven Scheduling”, JSSPP, co-located with IPDPS, 2022.
  • Y. Fan, B. Li, D. Favorite, N. Singh, T. Childers, P. Rich, W. Allcock, M. Papka, and Z. Lan, “DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing”, IEEE TPDS (under revision), 2022.
  • Y. Fan, Z. Lan, T. Childers, P. Rich, W. Allcock, and M. Papka, “Deep Reinforcement Agent for Scheduling in HPC”, IPDPS, 2021. [PDF]
  • Y. Fan and Z. Lan, “DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling”, Software Impacts, 2021. [PDF]

Software and Data:

  • DRAS/CQSim – a discrete event driven scheduling simulator empowered by reinforcement learning.
  • CQGym – a common platform for studying various cluster scheduling policies under the same setting. In CQGym, a discrete event driven scheduling environment is integrated with a scheduling agent such as deep reinforcement learning agent through openAI Gym interface.
  • Mantis – a unified performance and power profiling interface on heterogeneous systems. It not only provides a simple interface for automating complex profiling via many tools on different devices, but also offers a unified output data format for accelerating post-profiling data analysis.
  • MRSch – an intelligent scheduling agent for multi-resource scheduling in HPC that leverages direct future prediction (DFP), an advanced multi-objective reinforcement learning algorithm.
  • These artifacts are available in the team’s GitHub Link

Acknowlegement:

This project is supported by the US National Science Foundation (CCF CCF 2413597, 2109316). Note: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.