MapReduce and Hadoop have been key drivers behind the “Big Data” movement of the past decade. These systems impose a specific parallel paradigm on the algorithm designer while in return making parallel programming simple, obviating the need to think about concurrency, fault tolerance, and cluster management.
The MapReduce style of parallel processing has made certain operations nearly trivial to parallelize— Word Count is the canonical “Hello World” example. Still, parallelization of many problems, e.g., computing a good clustering, or counting the number of triangles in a graph, requires effort, since straightforward approaches yield almost no speedups over a single machine implementation.
This tutorial will cover recent results on algorithm design for MapReduce and other modern parallel architectures. We begin with an overview of the framework, and highlight the challenge of avoiding communication and computational bottlenecks. We then introduce a toolkit of algorithmic strategies for dealing with large datasets using MapReduce. The goal of most of these approaches is to reduce the data size (from petabytes and terabytes to gigabytes and megabytes), while preserving its structure relevant to the problem of interest. Sketching, composable coresets, and adaptive sampling all fall into this category of approaches.
We then turn to specific applications to both showcase these techniques, and highlight recently developed practical methods. Our initial focus is on clustering, whose many variants form the core of data analysis. We cover the classic clustering methods, such as k-means, as well as more modern approaches like correlation clustering and hierarchical clustering. We then turn to methods for graph analysis, building up our intuition with algorithms for graph connectivity and moving onto graph decompositions, matchings, spanning trees and subgraph counting.