Team

SPEAR
Systems for Performance, Energy, and Resiliency
Overview
The SPEAR team conducts research across diverse areas of parallel and distributed systems, including cluster management, interconnection networking, performance modeling and simulation, power and energy efficiency, and fault tolerance. Our mission is to unify high-performance computing (HPC) and artificial intelligence (AI). We are dedicated to building HPC systems that accelerate AI applications (HPC4AI) and to leveraging advanced AI technologies to address key research challenges in HPC (AI4HPC).
Members
- Zhiling Lan (Professor)
-
Mike Papka (Professor)
- Xin Wang (Postdoc)
- Shilpika (Postdoc at ANL)
-
Yuping Fan (Postdoc at ANL)
- Melanie Cornelius (PhD)
- Matthew Dearing (PhD)
- Zhong Zheng (PhD)
- Greg Cross (PhD)
- Yash Kurkure (PhD)
- Shambhawi Sharma (PhD)
- Amy Byrnes (PhD)
- Yihe (Jordan) Zhang (PhD)
- Yiheng Tao (PhD)
- Maisy Dunlavy (PhD)
- Kanglin Xu (PhD)
- Can Bagirgan (PhD)
- Aldo Cabrera (PhD)
- Akshar Patel (PhD)
- Niccolo Brembilla (PhD)
- Giacomo Brunetta (PhD in 2026)
Collaborators
- Valerie Taylor (ANL)
- Xingfu Wu (ANL)
- Kevin Brown (ANL)
- Rob Ross (ANL)
- Tanwi Mallick (ANL)
- Chris Carothers (RPI)
- Andrew Norman (Fermilab)
- Kwan-Liu Ma (UC Davis)
- ALCF Operations and AI teams
- ExaDigit
Software
All software artifacts are available on the SPEAR GitHub. Several representative tools are listed below:
CQSim: trace-based, event-driven scheduling simulator
- If you use CQSim in your work, please cite the paper: X. Yang, Z. Zhou, S. Wallace, Z. Lan, W. Tang, S. Coghlan, and M. Papka, “Integrating Dynamic Pricing of Electricity into Energy Aware Scheduling for HPC Systems”, Proc. of SC’13, 2013.
- The repo contains a branch called DRAS (Deep Reinforcement Learning Agent for HPC scheduling). If you use CQSim/DRAS in your work, please cite the paper: Y. Fan, T. Childers, P. Rich, W. Allcock, M. Papka, and Z. Lan, “Deep Reinforcement Agent for Scheduling in HPC”, Proc. of IPDPS’21, 2021.
Q-adaptive: multi-agent reinforcement learning based routing for Dragonfly networks
- If you use Q-adaptive/SST in your work, please cite the paper : Yao Kang, Xin Wang, and Zhiling Lan. “Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on Dragonfly Network”, Proc of HPDC’21.
Union: in-situ workload manager for CODES simulation
- If you use Union in your work, please cite the paper : X. Wang, M. Mubarak, Y. Kang, R. Ross, and Z. Lan, “Union: An Automatic Workload Manager for Accelerating Network Simulations”, Proc. of IPDPS, 2020.
DNPC: dynamic power capping library for HPC applications
- If you use DNPC in your work, please cite the paper : Sahil Sharma, Zhiling Lan, Xingfu Wu, and Valerie Taylor, “A Dynamic Power Capping Library for HPC Applications”, IEEE Cluster 2021.
MonEQ: application-level power profiling library on IBM Blue Gene/Q
- If you use MonEQ in your work, please cite the paper : S. Wallace, V. Vishwanath, S. Coghlan, Z. Lan, and M. Papka, “Profilling Benchmarks on IBM Blue Gene/Q”, Proc. of IEEE Cluster’13, 2013.
TopoMap: a suite of user-level library for effective topology-aware task mapping of MPI applications.
- It supports InfiniBand-connected supercomputers, Cray XT5, and IBM Blue Gene/P systems. If you use the tool, please cite the paper: J. Wu, X. Xiong, E. Berrocal, J. Wang, and Z. Lan, “Topology Mapping of Irregular Parallel Applications on Torus-Connected Supercomputers”, The Journal of Supercomputing, 2016.
SPEAR in Action








