“Dependable MapReduce in a Cloud-of-Clouds”

From Navigators

Pedro Costa (advised by Fernando Ramos, Miguel Correia)

Ph.D. dissertation, Doutoramento em Informática, Faculdade de Ciências da Universidade de Lisboa, Nov. 2017

Abstract: MapReduce is a simple and elegant programming model suitable for loosely coupled parallelization problems—problems that can be decomposed into subproblems. Hadoop MapReduce has become the most popular framework for performing large-scale computation on off-the-shelf clusters, and it is widely used to process these problems in a parallel and distributed fashion. This framework is highly scalable, can deal efficiently with large volumes of unstructured data, and it is a platform for many other applications. However, the framework has limitations concerning dependability. Namely, it is solely prepared to tolerate crash faults by re-executing tasks in case of failure, and to detect file corruptions using file checksums. Unfortunately, there is evidence that arbitrary faults do occur and can affect the correctness of MapReduce execution. Although such Byzantine faults are considered to be rare, particular MapReduce applications are critical and intolerant to this type of fault. Furthermore, typical MapReduce implementations are constrained to a single cloud environment. This is a problem as there is increasing evidence of outages on major cloud offerings, raising concerns about the dependence on a single cloud. In this thesis, we propose techniques to improve the dependability of MapReduce systems. The proposed solutions allow MapReduce to scale out computations to a multi-cloud environment, or cloud-of-clouds, to tolerate arbitrary and malicious faults and cloud outages. Our proposals have three important properties: they increase the dependability of MapReduce by tolerating the faults mentioned above; they require minimal or no modifications to users’ applications; and they achieve this increased level of fault tolerance at reasonable cost. To achieve these goals, we introduce three key ideas: minimizing the required replication; applying context-based job scheduling based on cloud and network conditions; and performing fine-grained replication. We evaluated all proposed solutions in real testbed environments running typical MapReduce applications. Our results demonstrate interesting trade-offs concerning resilience and performance when compared to traditional methods. The fundamental conclusion is that the cost introduced by our solutions is small, and thus deemed acceptable for many critical applications.