SPEAR

Systems for Performance, Energy, and Resiliency

Overview

The SPEAR team conducts research across diverse areas of parallel and distributed systems, including cluster management, interconnection networking, performance modeling and simulation, power and energy efficiency, and fault tolerance. Our mission is to unify high-performance computing (HPC) and artificial intelligence (AI). We are dedicated to building HPC systems that accelerate AI applications (HPC4AI) and to leveraging advanced AI technologies to address key research challenges in HPC (AI4HPC).

Members

Zhiling Lan (Professor)
Mike Papka (Professor)
Xin Wang (Postdoc)
Shilpika (Postdoc at ANL)
Yuping Fan (Postdoc at ANL)
Melanie Cornelius (PhD)
Matthew Dearing (PhD)
Zhong Zheng (PhD)
Greg Cross (PhD)
Yash Kurkure (PhD)
Chris Grams (PhD)
Amy Byrnes (PhD)
Yihe (Jordan) Zhang (PhD)
Yiheng Tao (PhD)
Maisy Dunlavy (PhD)
Kanglin Xu (PhD)
Can Bagirgan (PhD)
Aldo Cabrera (PhD)
Akshar Patel (PhD)
Niccolo Brembilla (PhD)
Giacomo Brunetta (MS/PhD)

Collaborators

We have a close partnership with several research teams at ANL, including the ALCF Operations team and the performance team led by Valerie Taylor.

Software

We actively develop and maintain open-source software on SPEAR GitHub. Several representative tools are listed below:

CQSim: an event-driven scheduling simulator designed for rapid what-if exploration of scheduling scenarios.
Q-adaptive: a multi-agent reinforcement learning driven routing design for Dragonfly networks.
MFNetSim: a hybrid network simulation framework for joint MPI and I/O modeling on Dragonfly systems.
MAGUS: a system-level library for adaptive uncore scaling that minimizes power waste on heterogeneous CPU-GPU systems.
DNPC: a user-level dynamic power capping library for parallel applications.

Team

SPEAR

Overview

Members

Collaborators

Software

SPEAR in Action