Software

CQSim: trace-based, event-driven scheduling simulator GitHub Link

  • If you use CQSim in your work, please cite the paper: X. Yang, Z. Zhou, S. Wallace, Z. Lan, W. Tang, S. Coghlan, and M. Papka, “Integrating Dynamic Pricing of Electricity into Energy Aware Scheduling for HPC Systems”, Proc. of SC’13, 2013.
  • The repo contains a branch called DRAS (Deep Reinforcement Learning Agent for HPC scheduling). If you use CQSim/DRAS in your work, please cite the paper: Y. Fan, T. Childers, P. Rich, W. Allcock, M. Papka, and Z. Lan, “Deep Reinforcement Agent for Scheduling in HPC”, Proc. of IPDPS’21, 2021.

Q-adaptive: multi-agent reinforcement learning based routing for Dragonfly networks GitHub Link

  • If you use Q-adaptive/SST in your work, please cite the paper : Yao Kang, Xin Wang, and Zhiling Lan. “Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on Dragonfly Network”, Proc of HPDC’21.

Union: in-situ workload manager for CODES simulation GitHub Link

  • If you use Union in your work, please cite the paper : X. Wang, M. Mubarak, Y. Kang, R. Ross, and Z. Lan, “Union: An Automatic Workload Manager for Accelerating Network Simulations”, Proc. of IPDPS, 2020.

CODES : flit-level, event-driven simulation toolkit for networking simulations GitHub

  • X. Wang, M. Mubarak, X. Yang, R. Ross, and Z. Lan, “Trade-off Study of Localizing Communication and Balancing Network Traffic on Dragonfly System”, Proc. of IPDPS’18 , 2018.
  • X. Yang, J. Jenkins, M. Mubarak, R. Ross, and Z. Lan, “Watch Out for the Bully! Job Interference Study on Dragonfly Network”, Proc. of SC16 , 2016.

DNPC: dynamic power capping library for HPC applications GitHub Link

  • If you use DNPC in your work, please cite the paper : Sahil Sharma, Zhiling Lan, Xingfu Wu, and Valerie Taylor, “A Dynamic Power Capping Library for HPC Applications”, IEEE Cluster 2021.

MonEQ: application-level power profiling library on IBM Blue Gene/Q GitHub Link

  • If you use MonEQ in your work, please cite the paper : S. Wallace, V. Vishwanath, S. Coghlan, Z. Lan, and M. Papka, “Profilling Benchmarks on IBM Blue Gene/Q”, Proc. of IEEE Cluster’13, 2013.

TopoMap: a suite of user-level library for effective topology-aware task mapping of MPI applications.GitHub Link

  • It supports InfiniBand-connected supercomputers, Cray XT5, and IBM Blue Gene/P systems.
  • J. Wu, X. Xiong, E. Berrocal, J. Wang, and Z. Lan, “Topology Mapping of Irregular Parallel Applications on Torus-Connected Supercomputers”, The Journal of Supercomputing, 2016.
  • J. Wu, X. Xiong, and Z. Lan, “Hierarchical Task Mapping for Parallel Applications on Supercomputers”, The Journal of Supercomputing, 71(5):1776-1802, 2015.

LibProfil: light-weight MPI profilling and tracing library for discovering the topology of MPI applications. GitHub Link

  • J. Wu, X. Xiong, E. Berrocal, J. Wang, and Z. Lan, “Topology Mapping of Irregular Parallel Applications on Torus-Connected Supercomputers”, The Journal of Supercomputing, 2016.

QSim: an event-driven job scheduling simulator for Cobalt. GitHub Link

  • W. Tang, N. Desai, D. Buettner, and Z. Lan, “Analyzing and Adjusting User Runtime Estimates to Improve Job Scheduling on Blue Gene/P”, Proc. of IPDPS’10 [Best Paper Award] , 2010.

Public Data at ALCF ALCF link

  • W. Allcock, P. Rich, Y. Fan, and Z. Lan, “Experience and Practice of Batch Scheduling on Leadership Supercomputers at Argonne”, Proc. of the 21st Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), held in conjunction with IPDPS, 2017.
  • W. Tang, Z. Lan, N. Desai, D. Buettner, and Y. Yu, “Reducing Fragmentation on Torus-Conneected Supercomputers”, Proc. IEEE Intl. Parallel & Distributed Processing Symp., pp. 828–839, May 2011.

Others:

  • TOPPER, a system-level tradeoff modeling tool for quantitative analysis of performance, power, and resilience on extreme scale sstems. It is built on the CPN tools using colored Petri nets. Check the paper: L. Yu, Z. Zhou, Y. Fan, M.E, Papka, and Z. Lan, “Sytem-Wide Tradeoff Modeling of Performance, Power, and Resilience on Petascale Systems”, Journal of Supercomputing, 2018.
  • PuPPET: a power-performance modeling tool for predictive analysis of power management on extreme scale systems. It is built on the CPN tools using colored Petri nets. Check the paper: L. Yu, Z. Zhou, S. Wallace, M.E, Papka, and Z. Lan, “Quantitative Modeling of Power Performance Tradeoffs on Extreme Scale Systems”, Journal of Parallel and Distributed Computing, 2015.
  • SysDP: an automated fault diagnosis and prognosis software toolkit for large-scale systems. It has been tested with RAS (Reliability, Availability, and Serviceability) logs from Blue Gene systems.
  • FT-Pro: an application-level adaptive fault tolerance system for parallel applications. Here, “application-level” means the focus is on reducing application completion time in the presence of failure. It allows applications to avoid anticipated failures via preventive migration, and in the case of unforeseeable failures, to minimize their impact through selective checkpointing. It is implemented with the MPICH-V checkpointing package.
  • FARS: a Fault-Aware Runtime System for system-level adaptive fault tolerance. Here, “system-level” means the primary goal is to improve system productivity in the presence of failure. It not only includes runtime strategies to allocate spare nodes for failure avoindance, but also provides a general mechanism to select running jobs for rescheduling in case of resource contention. An event-driven simulator is developed to emulate computing systems using batch scheduler enhanced with FARS. It has been tested with both synthetic data and machine traces collected from production systems.
  • FREM: a Fast REstart Mechanism to improve process recovery for general checkpoint/restart protocols. The core idea is to enable early process restart on partial checkpoint image by tracking data access patterns after each checkpoint. A prototype system which implements FREM with the BLCR checkpointing tool is developed. We have tested it with SPEC 2006. [FERM paper on IEEE Trans. on Computers]
  • ParaDLB and DistDLB: Dynamic load balancing methods for large-scale applications using the structured adaptive mesh refinement (SAMR) algorithm. The methods have been implemented and tested in the cosmological simulation code ENZO.
  • SWS (Seismic Wave Simulation): a seismic wave simulation package using finite element method. It can be used not only for theoretical studies of seismic waive propagation, but also for engineers engaged in seismic data acquisition, processing, interpretation and use of the inversion. The tool can be used to numerically solve any combination of acoustic wave equation, isotropic and anisotropic elastic wave equation, two-phase media wave equation. It was developed by Zhiling Lan and Xiumin Shao at Chinese Academia of Sciences during 1993-1997. The software was purchased and used by China National Petroleum Corporation.