Large-scale Distributed Computation

NII Shonan Meeting:

@ Shonan Village Center, Jan. 11-15, 2012

    Organizers:
  • Graham Cormode (AT&T Labs-Research)
  • S. Muthukrishnan (Rutgers University)
  • Ke Yi (Hong Kong University of Science and Technology)

Overview

As the amount of data produced by large scale systems such as environmental monitoring, scientific experiments and communication networks grows rapidly, new approaches are needed to effectively process and analyze such data. There are many promising directions in the area of large-scale distributed computation, that is, where multiple computing entities work together over partitions of the (huge) data to perform complex computations. Two important paradigms in this realm are distributed continuous monitoring, which continually maintains an accurate estimate of a complex query, and MapReduce, a primarily batch approach to large cluster computation. The aim of this meeting is to bring together computer scientists with interests in this field to present recent innovations, find topics of common interest and to stimulate further development of new approaches to deal with massive data.

Description of the Meeting

If the Computer Science in the 20th Century was defined by the inception and astounding growth
of computing power, in the 21st Century it will be defined by the huge growth in data. The ability
to capture vast quantities of data (from network traffic, scientific experiments, online activity) is
expanding rapidly, and at a rate far higher than the increase in our ability to process it. While
Moore’s law has held for many decades, tracking an exponential growth in computational power,
the corresponding law for data has recently shown an even faster growth in data production: see
recent estimates such as http://wikibon.org/blog/cloud-storage/.
Therefore, the new challenge for Computer Science in the 21st Century is how to deal with
such a data deluge. Several promising new directions have emerged in recent years:

Distributed Continuous Monitoring. In many settings the new data is observed locally—at a
router in a network, at a sensor in a larger sensor network. The volume of observations is too large
to move to a central location and process together; instead, it is necessary to perform distributed
computation over the data. Since the new observations are continually arriving, we must produce
a continual answer to complex monitoring queries, all while ensuring that the communication
cost necessary to maintain the result, and the computational cost of the tracking, are minimized
to meet the data throughput demands. In recent years, there have been several advances in this
field:

  • The geometric monitoring approach, which views the value of the function being monitored
    as points in a metric space, covered by monotonic balls in the region. This enables arbitrary
    complex functions to be decomposed into local conditions that can be monitored efficiently
    [16, 17, 15].
  • Use of sketches and randomized summaries to compactly summarize large complex distributions,
    allowing approximation of fundamental quantities such as order statistics, inner
    products, distinct items and frequency moments [7, 8].
  • New ways to incorporate time decay and sliding windows, to allow queries to be based
    on only the more recent observations, and reduce or remove the contribution of outdated
    observations, while still communicating much fewer than the full set of observations [4, 9].
  • Extension of monitoring techniques to more complex, non-linear functions such as the (empirical)
    entropy [16, 2].
  • Advances in sampling technology to enable drawing uniform samples over multiple streams
    of arrivals, arriving at different rates, both with and without replacement [3, 9].

However, there remain many challenging questions to address in this area:

  • A more general theory of distributed continuous computation: what functions and computations
    can be performed in this model? Are there hardness results and separation theorems
    that can be proved? Can a stronger set of lower bounds be provided?
  • A stronger sense of the connection to other areas in computer science, such as communication
    complexity, network coding, coding theory, compressed sensing, and so on. Can
    techniques from these areas be extended to the distributed continuous setting?
  • Systems issues have so far received only minor consideration. Can general purpose systems
    be designed, in a comparable way to centralized databases, which can accept complex
    queries in a high-level language, and then produce and deploy query monitoring plans?

Cluster Computing. As data sizes increase while the power of individual computing cores begin
to saturate, there has been a move to adopt cluster computing: harnessing multiple computing
nodes to work together to solve huge data processing tasks in parallel. The best known example of
this is the MapReduce paradigm, and its open source Hadoop implementation. Computationally
speaking, the approach is for each compute node to process data which is stored local to it in a
distributed file system, and Map this data by deriving a new data set indexed by a key value. The
system then collects all tuples with the same key together at a second node, which proceeds to
Reduce these tuples to obtain the (partial) output. This paradigm has proved highly successful
for many large scale computations, such as search engine log-analysis, building large machine
learning models and data analysis such as clustering. Other key technical advances include:

  • A foundation for models of this style of computation, including the Massive Unordered
    Distributed (MUD) model, and the Karloff-Suri-Vassilivitskii model (KSV). These help to
    understand what computations can be performed effectively on clusters [11, 13].
  • Development of Online versions of the protocols, which use pipelining of data between
    operators, to move beyond the batch view of cluster computing, and to quickly provide
    partial results to complex queries which are refined as computation progresses [6].
  • Initial attempts to understand what class of algorithms can be effectively performed in these
    models, such as set cover, clustering, database, geometric and graph computations [5, 14, 10,
    1, 12].
  • Overlays, such as Facebooks Hive, which provides advanced data processing and abstract
    query language on top of Hadoop to enable data warehousing applications [18].

The main challenges in this area include:

  • Understanding the extent to which approximation and randomization can further advance
    efficiency of computation.
  • Richer models which help us understand the tradeoff between the different cost parameters:
    amount of communication, number of compute nodes, memory available at each node,
    number of rounds of Map-Reduce, etc.
  • More examples of non-trivial computations that can be performed in Map-Reduce, leading
    to a more general theory of cluster computation and data processing under Map-Reduce.
  • More efficient implementations and variations of the model allowing real-time and more
    interactive use of clusters.

The aim of this workshop is to bring together researchers active in the areas of distributed data
processing, to address these fundamental issues. We hope to encourage greater cooperation and
interaction between currently separate communities, ultimately leading to new advances in these
important developing areas.

Comments are closed.