AI4LCF

AI-guided leadership facility management

Leadership computing facilities (LCF) across the country are facing significant changes in system architectures and workloads. Future machines are expected to comprise millions of processing elements embodying heterogeneity within various aspects from computing to memory and storage. Meanwhile, the emerging workloads include compute-intensive applications, as well as memory-intensive, data-intensive, and on-demand jobs. Current facility management approaches heavily rely on heuristics involving mundane manual processing. They will no longer be able to keep up with the evolutions introduced by emerging workloads and extreme heterogeneous system architectures necessary for exascale and, eventually, zettascale computing. Many facility management problems are fundamentally optimization issues.

AI/ML has proven useful in optimizing decision-making in complex systems given a sufficiently large data set. Several independent tools are already deployed at Argonne to collect performance and operational data. Our prior work leveraging reinforcement learning (RL) for intelligent job scheduling has shown that AI-guided workload management is feasible and promising. We believe the best outcomes will be achieved when human intelligence combines with emerging AI technologies.

We propose AI-guided leadership facility management (AI4LCF), where advanced AI technologies are utilized for automatic system monitoring and diagnosis, workload management, power and cooling distribution, policy control, and configuration optimization in real-time on a continuous basis.

Faculty:

  • Zhiling Lan (PI)

Graduate Students:

  • Yuping Fan (PhD, graduated in 12/2021)
  • Boyang Li (PhD, graduated in 8/2023)
  • Zhong Zheng (PhD)
  • Melanie Cornelius (PhD)
  • Greg Cross (PhD)
  • Yash Kurkure (PhD)
  • Shambhawi Sharma (PhD)

Collaborators:

  • Mike Papka (ANL/UIC)
  • Bill Allcock (ANL)
  • Paul Rich (ANL)

Key Publications:

  • B. Li, Z. Lan, and M. Papka, “Interpretable Modeling of Deep Reinforcement Learning Driven Scheduling”, MASCOTS, 2023.
  • B. Li, M. Dearing, Y. Fan, P. Rich, B. Allcock, M. Papka, and Z. Lan, “MRSch: Multi-Resource Scheduling for HPC”, IEEE Cluster, 2022.
  • B. Li, Y. Fan, M. Papka, and Z. Lan, “Encoding for Reinforcement Learning Driven Scheduling”, JSSPP, co-located with IPDPS, 2022.
  • Y. Fan, B. Li, D. Favorite, N. Singh, T. Childers, P. Rich, W. Allcock, M. Papka, and Z. Lan, “DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing”, IEEE TPDS, 2022.
  • Y. Fan, Z. Lan, T. Childers, P. Rich, W. Allcock, and M. Papka, “Deep Reinforcement Agent for Scheduling in HPC”, IPDPS, 2021.
  • Y. Fan and Z. Lan, “DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling”, Software Impacts, 2021.

Acknowledgment:

  • This project is supported by DOE/ANL subcontract. Note: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DOE/ANL.