The success of DOE ``big science’’ is increasingly tied to data analysis on extreme-scale distributed computing infrastructure. The DOE High EnergyPhysics (HEP) program in Neutrino and Collider science has been a key driver in applying and adapting data-intensive science and simulation codes to extreme-scaleplatforms. These complex, highly distributed workfl ows will continue to push the limits of current and future extreme-scale systems, especially as they evolve to utilizeincreasingly sophisticated AI/ML techniques for their data analysis. For example, the next generation of neutrino oscillation experiments, led by the DUNE experiment,are based on the liquid argon time project chamber technology. These detectors generate petabytes of high-resolution image data that capture high-energy, complexinteractions of neutrinos on argon nuclei and allow for the measurement of fundamental parameters in the neutrino mixing model. The expected data rates for the DUNEfar detectors (located 1,000 miles from FNAL) are capable of producing 6 GB of readout data per 5 milliseconds, providing a resolution, fi delity, and data volume nearly300 times greater than the equivalent interactions being captured with the current generation technologies. Components of this high-fi delity experimental data stream areintended to be analyzed in near real-time, requiring leadership-class computing resources to do so. The hierarchy of complex, high-performance computing, network,and storage components, as well as pathways from the experimental and leadership computing facilities, need to be modeled, analyzed, and tuned to meet the necessaryresponse times and resiliency to disruption under nominal and atypical operational conditions.

To address this modeling grand challenge and build on past successful collaborations, the Tachyon Project five-institution team proposes a framework that enables thescalable modeling, simulation, and validation of key performance characteristics for the Fermilab (FNAL) to Argonne Leadership Computing Facility (ALCF) distributedinfrastructure and associated HEP workfl ows. Our proposed framework is a multi-scale HEP workfl ow simulation model that will accurately model and predict the end-to-end workfl ow performance over the wide range of timescales and job/system failure scenarios that a resilient distributed HEP infrastructure must operate now and in thefuture.

The proposed research program is divided into Core Research Tasks (CRTs) and leverages design outcomes from the DOE EXPRESS Kronos Project, as well as the CODES systems modeling framework to integrate complementary modelingmethodologies: parallel discrete-event simulation (PDES), surrogate machine learning (ML) models and analytic models into an overall scalable system model(CRT T3).This scalable system model is coupled with extensive facility-supported performance data (CRT T1), resilient job scheduling (CRT T2), and highly informativevisualization and performance analysis (CRT T4).

The Tachyon Project will model the entire HEP distributed infrastructure and workfl ow campaign by creating surrogate ML models that are trained using both historicalfacility data as well as massively parallel CODES-generated simulation data that has been validated using extensive interactive visualization and analysis of the dataand models via facility performance data repositories. By leveraging validated, high-fi delity CODES simulation data, we are able to dramatically increase the predictiverange of surrogate ML models beyond the preset job and system confi gurations contained within current historical performance data repositories.

To capture the importance of resiliency, job placement, and scheduling in our scalable system modeling framework, the CQSim scheduling simulator will be extendedand integrated with CODES to enable workfl ow-centric/facility-centric scheduling and reliability modeling. To address the need for understanding phenomena in both thereal distributed infrastructure and the scalable system model, visual analytic methods will be developed for the facility performance data, systems simulators, andsurrogate models.

Impact: Our proposed scalable system model will have a target accuracy of 90\% and perform signifi cantly faster than existing high-fi delity modeling approaches,yielding a highly valuable ``what-if’’ planning tool for future distributed HEP experiments. More broadly, scalable modeling of distributed HEP systems and exploring theirstability will allow us to design workfl ow topologies and operational envelopes which will match the mission demands of other distributed science domains. In turn, theTachyon Project scalable system modeling framework will maximize the impact of the DOE’s investment in distributed infrastructures like FNAL and ALCFby improving their resiliency and increasing the rate of scientifi c discoveries across the breadth of the DOE’s experimental and computing resources.

PI Team

Chris Carothers (RPI)
Rob Ross and Kevin Brown (ANL)
Zhiling Lan (UIC)
Andrew Norman (FNAL)
Kwan-Liu Ma (UC Davis)

GitHub Link

Link