## Algorithm Design for MapReduce and Beyond: Tutorial

**Where:** CIKM 2015

**Presenters:**

##
Overview

MapReduce and Hadoop have been key drivers behind the “Big Data” movement of the past decade.
These systems impose a specific parallel paradigm on the algorithm designer while in return making
parallel programming simple, obviating the need to think about concurrency, fault tolerance, and cluster
management.

The MapReduce style of parallel processing has made certain operations nearly trivial to parallelize—
Word Count is the canonical “Hello World” example. Still, parallelization of many problems, e.g.,
computing a good clustering, or counting the number of triangles in a graph, requires effort, since
straightforward approaches yield almost no speedups over a single machine implementation.

This tutorial will cover recent results on algorithm design for MapReduce and other modern parallel
architectures. We begin with an overview of the framework, and highlight the challenge of avoiding
communication and computational bottlenecks. We then introduce a toolkit of algorithmic strategies
for dealing with large datasets using MapReduce. The goal of most of these approaches is to reduce
the data size (from petabytes and terabytes to gigabytes and megabytes), while preserving its structure
relevant to the problem of interest. Sketching, composable coresets, and adaptive sampling all fall into
this category of approaches.

We then turn to specific applications to both showcase these techniques, and highlight recently developed
practical methods. Our initial focus is on clustering, whose many variants form the core of data
analysis. We cover the classic clustering methods, such as k-means, as well as more modern approaches
like correlation clustering and hierarchical clustering. We then turn to methods for graph analysis,
building up our intuition with algorithms for graph connectivity and moving onto graph decompositions,
matchings, spanning trees and subgraph counting.