Kronos
Hybrid Discrete Event Simulations
Parallel discrete event simulation (PDES) is a modeling methodology that is of key importance to the U.S. Department of Energy’s (DOE)vscience mission. PDES contributes to many fields, including science enterprise design and provisioning, transportation and mobility applications, national energy grid applications, internet and cybersecurity simulations, materials science applications, and simulations for hardware co-design. In the context of hardware co-design, despite significant advances in both extreme-scale computing systems and PDES modeling frameworks to take advantage of these platforms, the simulation requirements and computational complexity of PDES hardware co-design models are growing at an intractable rate. Consequently, the timescales over which these hardware co-design models operate is limited to only a few seconds of simulated wall-clock time, making long-timescale PDES simulations of future extreme-scale systems out of reach for current PDES frameworks.
HPC workflows and applications have natural patterns that recur at different timescales. This situation suggests that reduced-order analytic/machine learning (ML) surrogate models, if trained appropriately, could be used to make fast and accurate predictions of behavior, replacing potentially billions or even trillions of PDES events. Specifically, traditional HPC simulation codes often have a “solver” loop comprising phases of computation and communication. Iterations of this loop exhibit similarities that may be predicted without representing the details of resource utilization in many cases. When executed on leadership computing platforms in concert, multiple HPC applications can interfere with one another because of contention for shared resources such as network and storage devices. These periods of interference also can potentially be accurately predicted by surrogate models. The timescale over which these models may be accurate could be large in practice. We hypothesize that for most extreme-scale systems, only a few key simulation inflection points in a long-running simulation cannot be predicted by the proposed set of surrogate models. Once an accurate surrogate model has been established for a particular workload mix, it is believed that model can be used to predict system performance for a large fraction of the simulation, with high-fidelity PDES models used only for key inflection points.
GOAL: The goal of the Kronos project is to create a surrogate-ready PDES framework and demonstrate the initial effectiveness of the surrogate modeling approach for improving hardware co-design simulations performance. We estimate that this approach could provide a 1000x improvement in execution time for hardware co-design simulations of interest to DOE. To realize the Kronos project vision, we have designed an aggressive research program for the two-year DOE Express program that includes the following major research objectives: (i) building a scalable workload module for hybrid simulations; (ii) creating machine-learning-driven and analytic surrogate models for PDES; (iii) enabling online transitions between PDES and surrogate models; and (iv) automatically transitioning between models. These research objectives will be evaluated using models of real HPC workloads
IMPACT: While conceptually promising, little prior work has investigated the potential for surrogate-model-based acceleration of hardware co-design PDES models. Our work takes important first steps that will shed light on the potential for this approach in understanding systems and workloads of interest to DOE.
Team Members:
- Kevin Brown (PI, ANL)
- Chris Carothers (co PI, RPI)
-
Zhiling Lan (co PI, UIC)
- Xin Wang (PhD, UIC)
- Elkin Cruz (PhD student, RPI)
- Xiongxiao Xu (PhD student, IIT)
- Matthew Dearing (PhD student, UIC)
- Pietro Lodi Rizzini (MS student, UIC)