AIMCI

AI-Guided Resource Management for Advanced Cyberinfrastructure

Advanced cyberinfrastructure (CI) is undergoing disruptive changes in system architectures and application workloads. The landscape of cyberinfrastructure workloads is rapidly expanding beyond traditional computational simulations to include a hybrid mix of applications. CI facilities now host diverse high-performance systems with heterogeneous configurations, leading to a complex mix of computing, memory, and storage components. Existing CI management methods, which are heavily heuristic or manual-based, struggle with these evolving challenges. This project addresses the complex challenges of CI resource management by integrating artificial intelligence (AI) technologies with human expertise. UIC is a federally designated Minority-Serving Institution (MSI). An integrated education plan can strengthen diversity-focused programs at UIC, thus promoting greater diversity and inclusion within the scientific community.

The project transitions from managing isolated single clusters to coordinating facility-wide management, orchestrating the entire facility as a unified pool of diverse resources for a broad spectrum of applications with various resource requirements. Specifically, it aims to design and evaluate an AI-guided framework named AIMCI (Artificial Intelligence for Managing Cyberinfrastructure). Key research thrusts are: (1) developing new AI models for predictive analysis of resource usage patterns and user behavior, (2) applying reinforcement learning methods to optimize resource management in a complex and dynamic computing environment, and (3) building a discrete event-driven simulator for exploratory simulation of CI resource management with human-in-the-loop interaction.

Faculty:

  • Zhiling Lan (PI)
  • Micheal Papka (co PI)

Graduate Students:

  • Zhong Zheng (PhD)
  • Yash Kurkure (PhD)
  • Shambhawi Sharma (PhD)
  • Yihe Zhang (PhD)
  • Yiheng Tao (MS)

Collaborators:

  • Valerie Taylor (ANL)
  • Xingfu Wu (ANL)
  • Bill Allcock (ANL)
  • Paul Rich (ANL)

Key Publications

  • X. Wang, K. Brown, R. Ross, C. Carothers, and Z. Lan, “CQSim+: Symbiotic Simulation for Multi-Resource Scheduling in High-Performance Computing”, ACM Transactions on Modeling and Computer Simulation (TOMACS), under revision, 2025.
  • Y. Kurkure, S. Sharma, X. Wang, M. Papka, and Z. Lan, “CQSim+: Symbiotic Simulation for Multi-Resource Scheduling in High-Performance Computing”, ACM SIGSIM PADS, 2025.
  • X. Wang, Y. Kang, and Z. Lan, “Preventing Workload Interference with Intelligent Routing and Flexible Job Placement Strategy on Dragonfly System”, ACM Transactions on Modeling and Computer Simulation (TOMACS), 2024.
  • M. Dearing, Y. Tao, X. Wu, Z., and V. Taylor, “LASSI: An LLM-based Automated Self-Correcting Pipeline for Translating Parallel Scientific Codes”, International Workshop on Large Language Models and HPC (LLMxHPC), 2024.

Software and Data:

  • CQSim/CQSim+ – a discrete event driven scheduling simulator empowered by reinforcement learning.
  • Mantis – a unified performance and power profiling interface on heterogeneous systems. It not only provides a simple interface for automating complex profiling via many tools on different devices, but also offers a unified output data format for accelerating post-profiling data analysis.
  • These artifacts are available in the team’s GitHub Link

Acknowlegement:

This project is supported by the US National Science Foundation (OAC #2402901). Note: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.