Optimizing Resource Management in Cloud Analytics Services

Citation

Ren, Xiaoqi (2018) Optimizing Resource Management in Cloud Analytics Services. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/K62Y-FV39. https://resolver.caltech.edu/CaltechTHESIS:05312018-080301508

Abstract

The fundamental challenge in the cloud today is how to build and optimize machine learning and data analytical services. Machine learning and data analytical platforms are changing computing infrastructure from expensive private data centers to easily accessible online services. These services pack user requests as jobs and run them on thousands of machines in parallel in geo-distributed clusters. The scale and the complexity of emerging jobs lead to increasing challenges for the clusters at all levels, from power infrastructure to system architecture and corresponding software framework design.

These challenges come in many forms. Today's clusters are built on commodity hardware and hardware failures are unavoidable. Resource competition, network congestion, and mixed generations of hardware make the hardware environment complex and hard to model and predict. Such heterogeneity becomes a crucial roadblock for efficient parallelization on both the task level and job level. Another challenge comes from the increasing complexity of the applications. For example, machine learning services run jobs made up of multiple tasks with complex dependency structures. This complexity leads to difficulties in framework designs. The scale, especially when services span geo-distributed clusters, leads to another important hurdle for cluster design. Challenges also come from the power infrastructure. Power infrastructure is very expensive and accounts for more than 20% of the total costs to build a cluster. Power sharing optimization to maximize the facility utilization and smooth peak hour usages is another roadblock for cluster design.

In this thesis, we focus on solutions for these challenges at the task level, on the job level, with respect to the geo-distributed data cloud design and for power management in colocation data centers.

At the task level, a crucial hurdle to achieving predictable performance is stragglers, i.e., tasks that take significantly longer than expected to run. At this point, speculative execution has been widely adopted to mitigate the impact of stragglers in simple workloads. We apply straggler mitigation for approximation jobs for the first time. We present GRASS, which carefully uses speculation to mitigate the impact of stragglers in approximation jobs. GRASS's design is based on the analysis of a model we develop to capture the optimal speculation levels for approximation jobs. Evaluations with production workloads from Facebook and Microsoft Bing in an EC2 cluster of 200 nodes show that GRASS increases accuracy of deadline-bound jobs by 47% and speeds up error-bound jobs by 38%.

Moving from task level to job level, task level speculation mechanisms are designed and operated independently of job scheduling when, in fact, scheduling a speculative copy of a task has a direct impact on the resources available for other jobs. Thus, we present Hopper, a job-level speculation-aware scheduler that integrates the tradeoffs associated with speculation into job scheduling decisions based on a model generalized from the task-level speculation model. We implement both centralized and decentralized prototypes of the Hopper scheduler and show that 50% (66%) improvements over state-of-the-art centralized (decentralized) schedulers and speculation strategies can be achieved through the coordination of scheduling and speculation.

As computing resources move from local clusters to geo-distributed cloud services, we are expecting the same transformation for data storage. We study two crucial pieces of a geo-distributed data cloud system: data acquisition and data placement. Starting from developing the optimal algorithm for the case of a data cloud made up of a single data center, we propose a near-optimal, polynomial-time algorithm for a geo-distributed data cloud in general. We show, via a case study, that the resulting design, Datum, is near-optimal (within 1.6%) in practical settings.

Efficient power management is a fundamental challenge for data centers when providing reliable services. Power oversubscription in data centers is very common and may occasionally trigger an emergency when the aggregate power demand exceeds the capacity. We study power capping solutions for handling such emergencies in a colocation data center, where the operator supplies power to multiple tenants. We propose a novel market mechanism based on supply function bidding, called COOP, to financially incentivize and coordinate tenants' power reduction for minimizing total performance loss while satisfying multiple power capping constraints. We demonstrate that COOP is "win-win", increasing the operator's profit (through oversubscription) and reducing tenants' costs (through financial compensation for their power reduction during emergencies).

Item Type:

Thesis (Dissertation (Ph.D.))

Subject Keywords:

Cloud computing, Resource management

Degree Grantor:

California Institute of Technology

Division:

Engineering and Applied Science

Major Option:

Computer Science

Awards:

Bhansali Family Dissertation Prize in Computer Science, 2018. Demetriades - Tsafka - Kokkalis Prize in Environmentally Benign Renewable Energy Sources or Related Fields, 2018.

Thesis Availability:

Public (worldwide access)

Research Advisor(s):

Wierman, Adam C.

Group:

Resnick Sustainability Institute

Thesis Committee:

Wierman, Adam C. (chair)
Low, Steven H.
Chandy, K. Mani
Yue, Yisong

Defense Date:

15 May 2018

Funders:

Funding Agency	Grant Number
Resnick Sustainability Institute fellowship	UNSPECIFIED

Record Number:

CaltechTHESIS:05312018-080301508

Persistent URL:

https://resolver.caltech.edu/CaltechTHESIS:05312018-080301508

DOI:

10.7907/K62Y-FV39

Related URLs:

URL	URL Type	Description
https://dl.acm.org/citation.cfm?id=2616475	Publisher	Adapted into Chapter 2
https://doi.org/10.1145/2829988.2787481	DOI	Adapted into Chapter 3
https://doi.org/10.1109/TNET.2018.2811374	DOI	Adapted into Chapter 4
https://doi.org/10.1145/2825236.2825252	DOI	Adapted into Chapter 5
https://doi.org/10.1109/HPCA.2016.7446084	DOI	Adapted into Chapter 5

ORCID:

Author	ORCID
Ren, Xiaoqi	0000-0002-1121-9046

Default Usage Policy:

No commercial reproduction, distribution, display or performance rights in this work are provided.

ID Code:

10978

Collection:

CaltechTHESIS

Deposited By:

Xiaoqi Ren

Deposited On:

01 Jun 2018 19:29

Last Modified:

08 Nov 2023 18:43

Thesis Files

Preview

PDF - Final Version
See Usage Policy.
6MB

Repository Staff Only: item control page