Publications

Ph.D. Dissertations:

  • Xin Wang, “Heterogeneous Workloads Study Towards Large-Scale Interconnect Network Simulation”, Illinois Institute of Technology, August 2023.
  • Boyang Li, “Efficient and Practical Cluster Scheduling for High Performance Computing” [Co-Advised with Michael Papka], Illinois Institute of Technology, July 2023.
  • Yao Kang, “Workload Interference Analysis and Mitigation on Dragonfly Class Networks”, Illinois Institute of Technology, Nov 2022.
  • Yuping Fan, “Intelligent Job Scheduling on High Performance Computing Systems” [Co-Advised with Michael Papka], Illinois Institute of Technology, Nov 2021.
  • Sean Wallace, “Power Profiling, Analysis, Learning, and Management for High-Performance Computing” [Co-Advised with Michael Papka], Illinois Institute of Technology, April 2017.
  • Xu Yang, “Cooperative Batch Scheduling for HPC Systems”, Illinois Institute of Technology, April 2017.
  • Eduardo Berrocal, “Improving Distributed Systems with Data Analysis”, Illinois Institute of Technology, April 2017.
  • Zhou Zhou, “Multi-Dimensional Batch scheduling Framework for High-End Supercomputers”, Illinois Institute of Technology, December 2015.
  • Li Yu, “Reliability and Energy Analysis for Extreme Scale Systems”, Illinois Institute of Technology, December 2015.
  • Jingjin Wu, “Performance Analysis and Optimization of Large-Scale Scientific Applications”, Illinois Institute of Technology, July 2013.
  • Wei Tang, “An Integrated Resource Management and Scheduling Framework for Production Supercomputers”, Illinois Institute of Technology, July 2012.
  • Ziming Zheng, “Log Analysis for Reliability Management in Large-Scale Systems”, Illinois Institute of Technology, July 2012.
  • Yawei Li, “Adaptive Fault Management for High-Performance Computing”, Illinois Institute of Technology, December 2008.

Recent Publications:

  • Y. Kang, X. Wang, and Z. Lan, “Workload Interference Prevention with Intelligent Routing and Flexible Job Placement on Dragonfly”, ACM SIGSIM-PADS’23, 2023.
  • X. Xu, X. Wang, E. Cruz, C. Carothers, K. Brown, R. Ross, Z. Lan and K. Shu, “Machine Learning for Interconnect Network Traffic Forecasting: Investigation and Exploitation”, ACM SIGSIM-PADS’23, 2023.
  • E. Cruz, K. Brown, X. Wang, X. Xu, K. Shu, Z. Lan, R. Ross and C. Carothers, “Hybrid PDES Simulation of HPC Networks using Zombie Packets”, ACM SIGSIM-PADS’23, 2023.
  • Y. Fan, B.Li, D. Favorite, N. Singh, T. Childers, P. Rich, W. Allcock, M. Papka, and Z. Lan, “DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing”, IEEE Transactions on Parallel and Distributed Systems (TPDS), 2022.
  • Y. Kang, X. Wang, Z. Lan, “Mitigating Network Contention with Intelligent Routing”, Proc of ACM/IEEE SC, 2022. PDF
  • B. Li, M. Dearing, B. Allcock, P. Rich, M. Papka, and Z. Lan, “MRSch: Multi-Resource Scheduling for HPC”, Proc. of IEEE Cluster, 2022.PDF
  • X. Wu, V. Taylor, and Z. Lan, “Performance and Power Modeling and Prediction Using MuMMI and Ten Machine Learning Methods”, Concurrency and Computation: Practice and Experience, 2022.
  • Y. Fan, Z. Lan, P. Rich, W. Allcock, and M. Papka, “Hybrid Workload Scheduling on HPC Systems”, Proc. of IPDPS, 2022. PDF
  • S. Sharma, Z. Lan, X. Wu, and V. Taylor, “A Dynamic Power Capping Library for HPC Applications”, IEEE Cluster (2-page extended research poster), 2021. PDF
  • Y. Kang, X. Wang, and Z. Lan, “Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on Dragonfly Network”, ACM HPDC, 2021. PDF
  • Y. Fan, Z. Lan, T. Childers, P. Rich, W. Allcock, and M. Papka, “Deep Reinforcement Agent for Scheduling in HPC”, IPDPS, 2021. PDF
  • Y. Fan and Z. Lan, “DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling”, Software Impacts, 2021.
  • X. Wang, M. Mubarak, Y. Kang, R. Ross, and Z. Lan, “Union: An Automatic Workload Manager for Accelerating Network Simulations”, Proc. of IPDPS, 2020. PDF
  • Y. Fan, Z. Lan, P. Rich, W. Allcock, M. Papka, B. Austin, and D. Paul, “Scheduling Beyond CPUs for HPC”, Proc. of HPDC’19 , 2019. PDF
  • Y. Kang, X. Wang, N. mcGlohon, M. Mubarak, S. Chunduri, M. Mubarak, and Z. Lan, “Modeling and Analysis of Application Interference on Dragonfly+”, Proc. of SIGSIM PADS’19 , 2019. PDF
  • B. Li, S. Chunduri, K. Harms, Y. Fan, and Z. Lan “The Effect of System Utilization on Application Performance Variability”, Proc. of ROSS’19 , 2019. PDF
  • X. Wang, M. Mubarak, X. Yang, R. Ross, and Z. Lan, “Trade-off Study of Localizing Communication and Balancing Network Traffic on Dragonfly System”, Proc. of IPDPS’18 , 2018. PDF
  • Y. Fan, P. Rich, W. Allcock, M. Papka, and Z. Lan, “Trade-off Between Prediction Accuracy and Underestimation Rate in Job Runtime Estimates”, Proc. of IEEE Cluster’17 (acceptance rate is 21.8%), 2017. PDF
  • W. Allcock, P. Rich, Y. Fan, and Z. Lan, “Experience and Practice of Batch Scheduling on Leadership Supercomputers at Argonne”, Proc. of the 21st workshop on Job Scheduling Strategies for Parallel Processing (JSSPP) , 2017. PDF
  • J. Wu, X. Xiong, E. Berrocal, J. Wang, and Z. Lan, “Topology Mapping of Irregular Parallel Applications on Torus-Connected Supercomputers”, Journal of Supercomputing , 73(4), 2017. PDF
  • X. Yang, J. Jenkins, M. Mubarak, R. Ross, and Z.Lan, “Watch Out for the Bully! Job Interference Study on Dragonfly Network”, Proc. of SC16 (acceptance rate is 18%), 2016.PDF
  • S. Wallace, X. Yang, V. Vishwanath, W. Allcock, S. Coghlan, M. Papka, and Z. Lan, “A Data Driven Scheduling Approach for Power Management on HPC Systems”, Proc. of SC16 (acceptance rate is 18%), 2016.PDF
  • X. Zheng, Z. Zhou, X. Yang, Z. Lan, and J. Wang, “Exploring Plan-Based Scheduling for Large-Scale Computing Systems”, Proc. of IEEE Cluster’16 (acceptance rate is 24%), 2016. PDF
  • Z. Zhou, X. Yang, Z. Lan, P. Rich, W. Tang, V. Morozov, and N. Desai, “Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D Torus Network Allocation Constraints”, IEEE Transactions on Parallel and Distributed Systems , 2016.PDF
  • Z. Zhou, X. Yang, D. Zhao, P. Rich, W. Tang, J. Wang, and Z. Lan, “I/O Aware Job Scheduling and Bandwidth Allocation for Petascale Computing Systems”, Journal of Parallel Computing (ParCo), 2016. PDF
  • S. Wallace, Z. Zhou, V. Vishwanath, S. Coghlan, J. Tramm, Z. Lan, and M.E. Papka, “Application Power Profiling on IBM Blue Gene/Q”, Journal of Parallel Computing (ParCo) , 2016. PDF
  • E. Berrocal, L. Bautista-Gomez, S. Di, Z. Lan, and F. Cappello, “Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC Applications”, Proc. of Euro-Par, 2016. PDF
  • L. Yu, Z. Zhou, S. Wallace, M.E, Papka, and Z. Lan, “Quantitative Modeling of Power Performance Tradeoffs on Extreme Scale Systems”, Journal of Parallel and Distributed Computing , 2015.PDF
  • L. Yu and Z. Lan, “A Scalable, Non-Parametric Anomaly Detection Method for Large Scale Computing”, IEEE Transactions on Parallel and Distributed Systems , vol. 99(7), pp. 1902-1914, 2015.PDF
  • S. Wallace, V. Vishwanath, S. Coghlan, Z. Lan, and M. Papka, “Comparison of Vendor Supplied Environmental Data Collection Mechanisms”, Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA), in conjunction with IEEE Cluster’15, 2015.PDF
  • Z. Zhou, X. Yang, D. Zhao, P. Rich, W. Tang, J. Wang, and Z. Lan, “I/O-Aware Batch Scheduling for Petascale Computing Systems”, Proc. of Cluster’15 , 2015.PDF
  • E. Berrocal, L. Bautista-Gomez, S. Di, Z. Lan, and F. Cappello, “Lightweight Silient Data Correpution Detection Based on Runtime Data Analysis for HPC Applications” (short paper), Proc. of HPDC’15, 2015.PDF
  • Z. Zhou, X. Yang, Z. Lan, P. Rich, W. Tang, V. Morozov, and N. Desai, “Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D Torus Network Allocation Constraints”, Proc. of IPDPS’15, 2015.PDF
  • E. Berrocal, L. Yu, S. Wallace, M. Papka, and Z. Lan, “Exploring Void Search for Fault Detection on Extreme Scale Systems”, Proc. of IEEE Cluster’14 [Best Paper Award], 2014.PDF
  • X. Yang, X. Zheng, Z. Zhou, W. Tang, J. Wang, and Z. Lan, “Balancing Job Performance with System Performance via Locality-Aware Scheduling on Torus-Connected Systems”, Proc. of IEEE Cluster’14 , 2014.PDF
  • J. Wu, X. Xiong, and Z.Lan, “Hierarchical Task Mapping for Parallel Applications on Supercomputers”, Journal of Supercomputing , vol. 71(5), 1776-1802, 2015. PDF
  • Z. Zheng, L. Yu, and Z.Lan, “Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart”, IEEE Trans. on Computers , 2014.PDF
  • X. Yang, Z. Zhou, S. Wallace, Z. Lan, W. Tang, S. Coghlan, and M. Papka, “Integrating Dynamic Pricing of Electricity into Energy Aware Scheduling for HPC Systems”, Proc. of SC’13, 2013. PDF This work is selected as one of the SC13 Research Highlight by HPCWire! link
  • S. Wallace, V. Vishwanath, S. Coghlan, Z. Lan, and M. Papka, “Profilling Benchmarks on IBM Blue Gene/Q”, Proc. of IEEE Cluster’13, 2013. PDF
  • L. Yu and Z. Lan, “A Scalable, Non-Parametric Anomaly Detection Framekwork for Hadoop”, Proc. of the ACM Cloud and Autonomic Computing Conference (CAC’13), 2013. PDF
  • Z. Zhou, Z. Lan, W. Tang, and N. Desai, “Reducing Energy Costs for IBM Blue Gene/P via Power-Aware Job Scheduling”, Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), 2013. This work is selected as one of the top research items in the week of March, 28, 2013 by HPCWire! link
  • W. Tang, D. Ren, Z. Lan, and N. Desai, “Toward Balanced and Sustainable Job Scheduling for High Performance Computing”, Parallel Computing (ParCo) , 2013.PDF
  • W. Tang, N. Desai, D. Buettner, and Z. Lan, “Job Scheduling with Adjusted Runtime Estimates on Production Supercomputers”, Journal of Parallel and Distributed Computer (JPDC), 2013.PDF
  • S. Wallace, V. Vishwanath, S. Coghlan, Z. Lan, and M. Papka,”Measuring Power Consumption on IBM Bllue Gene/Q”, The 9th Workshop on High-Performance, Power-Aware Computing (HPPAC), 2013. PDF
  • Y. Yu, J. Wu, Z. Lan, D. Rudd, N. Gnedin, and A. Kravtsov, “A Transparent Collective I/O Implementation”, Proc. of IPDPS’13, 2013. PDF
  • J. Wu, Z. Lan, X. Xiong, N. Gnedin, and A. Kravtsov, “Hierarchical Task Mapping of Cell-based AMR Cosmology Simulations”, Proc. of SC’12, 2012. PDF
  • Z. Zheng, L. Yu, Z. Lan, and T. Jones, “3-Dimensional Root Cause Diagnosis via Co-Analysis”, Proc. of ICAC’12, 2012. PDF
  • L. Yu, Z. Zheng, Z. Lan, T. Jones, J. Brandt, and A. gentile, “Filtering Log Data: Finding the needles in the Haystack”, Proc. of DSN’12, 2012. PDF
  • Y. Yu, D. Rudd, Z. Lan, N. Gnedin, A. Kravtsov, and J. Wu, “Improving Parallel IO Performance of Cell-based AMR Cosmology Applications”, Proc. of IPDPS’12, 2012. PDF
  • W. Tang, N. Desai, V. Vishwanath, D. Buettner, and Z. Lan, “Multi-Domain Job Coscheduling for Leadership Computing Systems”, Journal of Supercomputing, 2011. PDF
  • J. Wu, R. Gonzalez, Z. Lan, N. Gnedin, A. Kravtsov, D. Rudd, and Y. Yu, “Performance Emulation of Cell-based AMR Cosmology Simulations”, Proc. of Cluster’11, 2011. PDF
  • L. Yu, Z. Zheng, Z, Lan, and S. Coghlan, “Practical Online Failure Prediction for Blue Gene/P: Period-based vs Event-Driven”, Proc. of Proactive Failure Avoidance, Recovery, and Maintenance Workshop (PFARM) , 2011. PDF
  • W. Tang, Z. Lan, N. Desai, D. Buettner, Y. Yu, “Reducing Fragmentation on Torus-Connected Supercomputers”, Proc. of IPDPS’11, 2011. PDF
  • Z. Zheng, L. Yu, W. Tang, Z. Lan, R. Gupta, N. Desai, S. Coghlan, and D. Buettner, “Co-Analysis of RAS Log and Job Log on Blue Gene/P”, Proc. of IPDPS’11, 2011. PDF
  • Y. Li and Z. Lan, “FREM: A Fast Restart Mechanism for General Checkpoint/Restart”, IEEE Trans. on Computers, 60(5), 2011. PDF
  • Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P. Beckman, “A Practical Failure Prediction with Location and Lead Time for Blue Gene/P”, Proc. of the 1st Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS), 2010. PDF
  • Z. Lan, J. Gu, Z. Zheng, R. Thakur, and S. Coghlan, “A Study of Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems” Journal of Parallel and Distributed Computing (JPDC), 2010. PDF
  • W. Tang, N. Desai, D. Buettner, and Z. Lan, “Analyzing and Adjusting User Runtime Estimates to Improve Job Scheduling on Blue Gene/P”, Proc. of IPDPS’10 [Best Paper Award] , 2010. PDF
  • Z. Lan, Z. Zheng, and Y. Li, “Toward Automated Anomaly Identification in Large-Scale Systems”, IEEE Trans. on Parallel and Distributed Systems, 21(2), pp. 174-187, 2010. PDF
  • Z. Zheng and Z. Lan, “Reliability-Aware Scalability Models for High Performance Computing”, Proc. of IEEE Cluster’09, 2009. PDF
  • W. Tang, Z. Lan, N. Desai, and D. Buettner, “Fault-Aware Utility-Based Job Scheduling on Blue Gene/P Systems”, Proc. of IEEE Cluster’09, 2009. PDF
  • Y. Li, Z. Lan, P. Gujrati, and X. Sun, “Fault-Aware Runtime Strategies for High Performance Computing”, IEEE Trans. on Parallel and Distributed Systems , vol. 20(4), pp. 460-473, 2009. PDF
  • Z. Zheng, Z. Lan, B-H. Park, and A. Geist, “System Log Pre-processing to Improve Failure Prediction”, Proc. of DSN’09, 2009. PDF
  • H. Jin, X. Sun, Z. Zheng, Z. Lan and B. Xie, “Performance under Failures of DAG-based Parallel Computing”, Proc. of CCGrid’09, 2009. PDF
  • J. Gu, Z. Zheng, Z. Lan, J. White, E. Hocks, and B-H. Park, “Dynamic Meta-Learning for Failure Prediction in Large-scale Systems: A Case Study”, Proc. of ICPP’08 , 2008. PDF
  • Y. Li and Z. Lan, “A Fast Recovery Mechanism for Checkpointing in Networked Environments”, Proc. of DSN’08, 2008. PDF
  • Z. Lan and Y. Li, “Adaptive Fault Management of Parallel Applications for High Performance Computing”, IEEE Trans. on Computers ,vol. 57(12), pp. 1647-1660, 2008. PDF
  • Z. Zheng, Y. Li, and Z. Lan, “Anomaly Localization in Large-scale Clusters”, Proc. of IEEE Cluster’07, 2007. PDF
  • P. Gujrati, Y. Li, Z. Lan, R. Thakur, and J. White,”Exploring Meta-learning to Improve Failure Prediction in Supercomputing Clusters”, Proc. of ICPP’07 , 2007. PDF
  • Y. Li, P. Gujrati, Z. Lan, and X. Sun, “Fault-Driven Re-Scheduling for Improving System-Level Fault Resilience”, Proc. of ICPP’07 ,2007. PDF
  • Z. Lan, Y. Li, P. Gujrati, Z. Zheng, R. Thakur, and J. White, “A Fault Diagnosis and Prognosis Service for TeraGrid Clusters”, Proc. of TeraGrid’07 , 2007. PDF
  • Y. Li and Z. Lan, “Using Adaptive Fault Tolerance to Improve Application Robustness on the TeraGrid”, Proc. of TeraGrid’07 , 2007. PDF
  • Y. Li and Z. Lan, “Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing”, Proc. of CCGrid, 2006. PDF
  • Z. Lan, V. Taylor, and Y. Li, “DistDLB: Improving cosmology SAMR simulations on distributed computing systems through hierarchical load balancing”, Journal of Parallel and Distributed Computing (JPDC), Vol. 66(5), pp. 716-731,2006.
  • J. Lee, Z. Lan, J. Amundson, and P. Spentzouris, “Evaluating Performance and Scalability of Advanced Accelerator Simulations”,Proc. of CCGrid , 2006. PDF
  • Y. Li and Z. Lan, “Proactive Fault Manager for High Performance Computing”, Proc. of The International Conference on Dependable Systems and Networks (Fast Abstract) , 2005. PDF
  • Y. Li and Z. Lan, “A Novel Workload Migration Scheme for Heterogeneous Distributed Computing”, Proc. of CCGrid, 2005. PDF
  • Z. Lan and P. Deshikachar, “Performance Analysis of a Large-Scale Cosmology Application on Three Cluster Systems”, Proc. of IEEE Cluster 2003, 2003.PDF
  • Z. Lan, V. Taylor, and G. Bryan, “Exploring Cosmology Applications on Distributed Environments”, Journal of Future Generation Computer Systems, Vol. 19(6), pp. 839-847, August, 2003.
  • Z. Lan, V. Taylor, and G. Bryan, “A Novel Dynamic Load Balancing Scheme for Parallel Systems”, Journal of Parallel and Distributed Computing (JPDC), Vol 62/12, pp.1763-1781, 2002.
  • Z. Lan, V. Taylor, and G. Bryan, “Dynamic Load Balancing of SAMR Applications on Distributed Systems”, Proc. of SC01, 2001.
  • Z. Lan, V. Taylor, and G. Bryan, “Dynamic load balancing for structured adaptive mesh refinement applications”, Proc. of ICPP01, 2001.
  • V. Taylor, X. Wu, X. Li, J. Geisler, Z. Lan, M. Hereld, I. Judson and R. Stevens, “Prophesy: Automating the Modeling Process”, Third Annual International Workshop on Active Middleware Services (invited paper), 2001.
  • X. Wu, V. Taylor, X. Li, J. Geisler, Z. Lan, R. Stevens, M. Hereld, and I. Judson, “Design and Development of Prophesy Performance Database for Distributed Scientific Applications”, Proc. 10th SIAM Conference on Parallel Processing for Scientific Computing, March 2001.
  • V. Taylor, X. Wu, J. Geisler, X. Li, Z. Lan, R. Stevens, M. Hereld and I. Judson, “Prophesy: An Infrastructure for Analyzing and Modeling the Performance of Parallel and Distributed Applications”, Proc. HPDC, 2000.