COTA
a COoperative framework for Topology Awareness
As the number of computer nodes increases, so does the size of the interconnect network. Historically, floating point was the most costly component of a system, but this is no longer the case. Systems today, and those anticipated in the future, are increasingly bound by their communication infrastructure and the power dissipation associated with data movement across the rapidly growing number of nodes. How to address the increasing cost of data movement on ever-growing systems becomes critical.
This project develops a framework named COTA, a COoperative framework for Topology Awareness. COTA is an integrated framework that coordinates across the hardware, job scheduler, runtime, and application to jointly attack the increasing concern of data movement for communication- and power-efficiency on large-scale systems. Most importantly, the framework supports topology awareness not only at job startup, but also during job execution. The newly developed mapping algorithms, topology-aware methods and tools, and topology-aware models provide a critical foundation for the realization of topology awareness on current and future systems. This research has a direct impact on system productivity as well as a broad range of application domains that use parallel systems for simulations. The project also enhances the curriculum at Illinois Tech, broadens the participation by underrepresented groups, and outreaches to the surrounding communities.
A 1-page poster summary of COTA is available: poster’17.
Faculty:
- Zhiling Lan (PI, CS faculty)
- Jia Wang (co PI, ECE faculty)
Graduate Students:
- Xu Yang (CS Ph.D. student, now at Amazon) (2013-2017)
- Xingwu Zheng (ECE Ph.D. student) (2013-2017)
- Xin Wang (CS Ph.D. student) (2014-2017)
- Manqi Zhang (CS Ph.D. studnet) (9/2016-12/2016)
- Peixin Qiao (CS Ph.D. student) (8/2016-9/2017)
- Zhou Zhou (CS Ph.D. student, now at Salesforce) (2012-2016)
- Yuping Fan (ECE Master, now CS Ph.D. student) (2014-2016)
- Ying Chen (CS Ph.D. student) (1/2016-7/2016)
- Eduardo Berrocal (CS Ph.D. student) (2014)
- Jianchao Yang (CS Master student, 10/2013-05/2014)
- Qi Zhan (CS Master student, 10/2013-05/2014)
REU Students:
- Arushi Rai (CS Undergraduate, 7/2017-9/2017) (REU Report)
- Blake Ehrenbek (CS Undergraduate, 7/2017-9/2017) (REU Report)
- Aleksandra Kukielko (CS undergraduate, 2016) (REU Report)
- Jia Hao He (CS undergraduate, 09/2015-12/2015)
- Tarun N. Gidwani (CS undergraduate, 2014)
- Asad Patel (CS undergraduate, 2014)
Collaborators:
- Jingjin Wu at Univ. of Elect. Science & Tech., China
- Xuangxing Xiong at Synopsys Inc.
- Paul Rich, Vitali Morozov, John Jenkins, Misbah Mubarak, and Rob Ross at Argonne National Laboratory
- Wei Tang and Narayan Desai at Google Inc.
Key Publications:
- X. Yang, J. Jenkins, M. Mubarak, R. Ross, and Z.Lan, “Watch Out for the Bully! Job Interference Study on Dragonfly Network”, Proc. of SC16 (acceptance rate is 18%), 2016.[PDF]
- X. Zheng, Z. Zhou, X. Yang, Z. Lan, and J. Wang, “Exploring Plan-Based Scheduling for Large-Scale Computing Systems”, Proc. of IEEE Cluster’16 (acceptance rate is 24%), 2016. [PDF]
- Z. Zhou, X. Yang, Z. Lan, P. Rich, W. Tang, V. Morozov, and N. Desai, “Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D Torus Network Allocation Constraints”, IEEE Transactions on Parallel and Distributed Systems , 2016.
- Z. Zhou, X. Yang, D. Zhao, P. Rich, W. Tang, J. Wang, and Z. Lan, “I/O Aware Job Scheduling and Bandwidth Allocation for Petascale Computing Systems”, Journal of Parallel Computing (ParCo), 2016. [PDF]
- J. Wu, X. Xiong, E. Berrocal, J. Wang, and Z. Lan, “Topology Mapping of Irregular Parallel Applications on Torus-Connected Supercomputers”, The Journal of Supercomputing, , 2016.
- J. Wu, X. Xiong, and Z. Lan, “Hierarchical Task Mapping for Parallel Applications on Supercomputers”, The Journal of Supercomputing, , 71(5):1776-1802, 2015.
- X. Yang, J. Jenkins, M. Mubarak, X. Wang, R. Ross, and Z. Lan, “Study of Intra- and Inter-Job Interference on Torus Networks”, Proc. of ICPADS (The 22nd IEEE Intl Conf. on Parallel and Distributed Systems), , 2016.
- Z. Zhou, X. Yang, D. Zhao, P. Rich, W. Tang, J. Wang, and Z. Lan, “I/O-aware Batch Scheduling for Petascale Computing Systems”, Proc. of Cluster’15, 2015.
- Z. Zhou, X. Yang, Z. Lan, P. Rich, W. Tang, V. Morozov, and N. Desai, “Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D Torus Network Allocation Constraints”, Proc. of IPDPS’15, , 2015.
- X. Yang, X. Zheng, Z. Zhou, W. Tang, J. Wang, and Z. Lan, “Balancing Job Performance with System Performance via Locality-Aware Scheduling on Torus-Connected Systems”, Proc. of IEEE Cluster’14, , 2014.[PDF]
- X. Yang, Z. Zhou, S. Wallace, Z. Lan, W. Tang, S. Coghlan, and M. Papka, “Integrating Dynamic Pricing of Electricity into Energy Aware Scheduling for HPC Systems”, Proc. of SC’13, 2013. [PDF]
Ph.D. Dissertations:
- Xingwu Zheng, “Advanced Algorithms For HPC Job Scheduling” [Advisor: Jia Wang, co-advisor: Zhiling Lan], Department of Electrical and Computer Engineering, Illinois Institute of Technology, November 2017.
- Xu Yang, “Cooperative Batch Scheduling for HPC Systems” [Advisor: Zhiling Lan], Department of Computer Science, Illinois Institute of Technology, April 2017.
- Zhou Zhou, “Multi-Dimensional Batch scheduling Framework for High-End Supercomputers” [Advisor: Zhiling Lan], Department of Computer Science, Illinois Institute of Technology, December 2015.
Software Tools:
- (Software) CQSim - a discrete event driven scheduling simulator. Link
- (Software) LibProfil - a light-weight user-transparent communication profiler. Link
- (Software) TOPOMap - two topology aware task mapping libraries. Link
Acknowlegement:
This project is supported by the US National Science Foundation (CNS-1320125). Note: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.