NSF Grant XXX
Optimizing Large-Scale Heterogeneous ML Platforms
Large-scale artificial intelligence and machine learning (AI/ML) platforms are playing a vital role in the current data revolution. To minimize efforts from users, an end-to-end solution is desired to deploy complex workflow over possibly heterogeneous clusters. However, the scheduling and resource management problems behind such “push-button” deployment are challenging. If left unsolved, these costly systems will be severely under-utilized, leading to unnecessary electricity consumption and greenhouse gas emissions. This project will develop efficient resource allocation policies for distributed, large-scale AI/ML systems to tackle the challenges.
Specifically, this project will accelerate and parallelize the large-scale optimization and inference tasks that dominate workloads in AI/ML platforms via distributed optimization that provides fault tolerance and robustness to stragglers in heterogeneous settings. Built upon the distributed optimization, the project will further schedule AI/ML workflows with precedence constraints among sub-tasks. Finally, heterogeneous resources are allocated among jobs fairly and efficiently in the case where the resources being allocated are exchangeable, which is key for AI/ML platforms with graphic processing units (GPUs) and other accelerators.
The project will provide new fundamental algorithms for scheduling and resource allocation in AI/ML platforms used across academia and industry. The algorithmic ideas will be developed in the context of core, classical models and so will apply more broadly than AI/ML platforms, e.g., to networking, storage, supply chain management, and beyond. The project will seek to broaden the participation of underrepresented groups in STEM areas by planned activities including the development of accelerated mathematics programs for middle school students, summer programs for middle-school and high-school students, and summer research programs for undergraduate students.
Publications & Output
To be added as the project progresses.