Bridging the theory of staged programming languages and the practice of high-performance computing

Icon

Shonan Meeting seminar019

Shonan Challenge

Challenge Request Form by Kenichi Asai

A brief list of challenges

Google Group

GitHub

Kenichi Asai. Challenge 1: (Simplified) Hidden Markov Model

External pointers

Tiark Rompf’s Delite mini-tutorial

Tempo, a partial evaluator for C

Workshop on essential abstractions in GCC, 2012

Schedule

May 18 (Friday)

  • 15:00 -: Hotel Check In (early check-in from 12:00 is negotiable if informed in advance)
  • 19:00 – 21:00: Welcome Reception

May 19 (Saturday)

May 20 (Sunday)

May 21 (Monday)

May 22 (Tuesday)

Discussion materials

BINQ: A Domain Specific Language for Genomic Computations

Ashish Agarwal, NYU, USA

Unabated advances in genomics technologies continue to increase the volume and variety of data that Biologists work with on a day-to-day basis. This results in two distinct software challenges: (i) Biologists require a platform that manages the deluge of data and methods they regularly employ, and (ii) this platform must parallelize code over multiple cores and nodes. I propose BINQ, a purely functional domain-specific-language (DSL) that allows expressing a wide range of computations commonly needed in genomics and systems biology. The type system aims to support the wide range of data and analyses required in a compact elegant language, and also to encode properties required for efficient implementation, such as whether a list is sorted. I will describe the challenges in implementing this language and explain where the parallelism opportunities lie. Finally, I will review our OCaml code base, which operates NYU’s Genomics Core Facility, and serves as motivation for the design of BINQ.

Combining Staged Programming and Empirical Optimization

Baris Aktemur, Ozyegin University, Turkey

A major motivation behind staged programming is to speed up programs by specializing them according to runtime inputs. Applications of staged programming we see in (theoretical) papers are mostly small examples that do not scale to large data used in HPC. To remedy this problem, staged programming must be performed with awareness of how to process large data efficiently. For instance, the generated program must utilize the cache effectively. However, given the complexity of modern computer architectures, this is not a trivial task. Empirical optimization (aka auto-tuning) techniques have been successfully used in HPC to adapt programs to target machines by selecting the best version as observed in install-time experiments. I will discuss our on-going work on combining empirical optimization with staged programming to select the best generator out of several candidates. This is a joint work with Prof. Sam Kamin and his research group.

A Calculational Framework for Parallel Programming with MapReduce

Zhenjiang Hu, NII, Japan

MapReduce, being inspired by the map and reduce primitives available in many functional languages, is the de facto standard for large scale data-intensive parallel programming. Although it has succeeded in popularizing the use of the two primitives for hiding the details of parallel computation, little effort has been made to emphasize the programming methodology behind. In this talk, I’d show that MapReduce can be equipped with a programming theory in calculational form. By integrating the generate-and-test programing paradigm and semirings for aggregation of results, we propose a novel parallel programming framework for MapReduce. The framework consists of two important calculation theorems: the shortcut fusion theorem of semiring homomorphisms bridges the gap between specifications and efficient implementations, and the filter-embedding theorem helps to develop parallel programs in a systematic and incremental way.
This is a joint work with Kento Emoto and Sebastian Fischer.

Towards application of supercompilation and metacomputation to high performance computing

Andrei V. Klimov, Keldysh Institute of Applied Mathematics, Russian Academy of Sciences, Russia

I work in a team which has experience in two areas: 1) construction of program specializers based on the methods of supercompilation and partial evaluation (which we collectively refer to as metacomputation) and 2) construction of supercomputers and development of software for them. Based on this work, we see large potential impact of metacomputation on HPC, most notable being as follows.

  • Our experience shows that powerful program specializers are capable of converting well-structured human-oriented, but badly parallelizable programs into “flat” code consisting of nested loops, for which existing parallelization methods apply well.
  • It is common opinion that development of new models of computation and renewing of some old ones (like dataflow) is required, which poses the problem of transforming legacy code to new forms (e.g., Fortran code to dataflow, data parallel, SIMD, pipelined code). There is no silver bullet, but such deep program analysis and transformation methods are actually emerging.
  • Another widespread viewpoint is that program verification becomes much more important in the parallel era than before. Between two approaches to program verification: based on logic methods and based on deep program transformation like supercompilation, we chose the later as more promising.

Embedded DSL for Stocastic Processes

Ken Friis Larsen, University of Copenhagen, Denmark

We present a language for specifying stochastic processes, called SPL. We show that SPL can express the price of a range of financial contracts, including so called exotic options with path dependence and with multiple sources of uncertainty. Jones, Eber and Seward previously presented a language for writing down financial contracts in a compositional manner, and specified a pricer for these contracts in terms of an abstract financial model and abstract stochastic processes. For the subset of prices that do not require nested forecasting, these can be specified in SPL, and we show an example of how to do this. The ease of writing a model that matches reality and the speed of computing the expected price is then highly dependent on the properties of SPL.
SPL is designed with the goal of matching the notation used in mathematical finance, which allows high-level specification of stochastic processes. The language is deeply embedded in Haskell, and we have given the language formal semantics in terms of the probability monad, as well as a type system in terms of Haskell’s type system. We provide an implementation of SPL that performs Monte Carlo simulation using GP-GPU, and we present data indicating that this implementation scales linearly with the number of available cores.

MetaHaskell: Type Safe Heterogeneous Metaprogramming

Geoffrey Mainland, Microsoft Research Cambridge , UK

Languages with support for metaprogramming, like MetaOCaml, offer a principled approach to code generation by guaranteeing that well-typed metaprograms produce well-typed programs. However, many problem domains where metaprogramming can fruitfully be applied require generating code in languages like C, CUDA, or assembly. Rather than resorting to add-hoc code generation techniques, these applications should be directly supported by explicitly heterogeneous metaprogramming languages.
In this talk I will present MetaHaskell, an extension of Haskell 98 that provides modular syntactic and type system support for type safe metaprogramming with multiple object languages. Adding a new object language to MetaHaskell requires only minor modifications to the host language to support type-level quantification over object language types and propagation of type equality constraints. We demonstrate the flexibility of our approach through three object languages: a core ML language, a linear variant of the core ML language, and a subset of C. All three languages support metaprogramming with open terms and guarantee that well-typed MetaHaskell programs will only produce closed object terms that are well-typed.

Paraiso: A code generation and automated tuning framework for explicit solvers of partial differential equations

Takayuki Muranushi, Kyoto University, Japan

As a simulation astrophysicist, my colleagues and I wish for something that (1) generates optimized codes on massively parallel computers such as the K computer, GPU-based supercomputers etc. (2) can describe practical (= complicated) algorithms and (3) is easy to program. None were available. Advances in programming languages made it possible even for physicists to design their own DSL. My hope is collaboration. Parallel and abstract languages act as interfaces, letting one group decompose their problems to such interface languages and other groups work on translating to real machines, and both groups work together in developing the interface.
As an example of such abstract interface I describe the Orthotope Machine and Paraiso, a DSL framework for generating parallel machine stencil computation, embedded in Haskell. Paraiso lets one write complicated hyperbolic partial differential equations (PDE) algorithms in succinct ways, by using mathematical language like type level tensors and algebraic type classes. Higher-order operations are also possible. Paraiso turns PDE algorithms into a program on the Orthotope Machine, a parallel virtual machine with very limited instruction set. These programs are then translated into multicore or GPU implementations. Paraiso automatically searches for faster implementation by benchmarking and evolutionary computation.
http://paraiso-lang.org/wiki/

Staging and Linguistic Reuse: Maintaining Relative Evaluation Order Across Stage Boundaries

Tiark Rompf, EPFL, Switzerland

Staging enables safer and more productive development of program generators through linguistic reuse: expressing the code generator and pieces of generated code in the same language allows the generated code to inherit static guarantees such as syntax correctness, scope correctness and type correctness from the language of the code generator. However, staging approaches in the MetaML tradition do not provide linguistic reuse of evaluation order. This is a big open problem in practice. Adding staging annotations to a program can dramatically change the result and the runtime behavior of the program. The underlying cause of this problem is that the staging mechanism, quasiquotation, treats program fragments in a syntactic, context-free way and not as context-dependent statements. In this talk, we will discuss the problem of maintaining relative execution order within a stage and describe a possible solution that employs a context-sensitive form of quasiquotation. We will also briefly discuss how Lightweight Modular Staging (LMS) achieves safety guarantees through linguistic reuse without quasiquotation.

High Performance Embedded DSLs with Delite

Tiark Rompf, EPFL, Switzerland

Delite is a framework for building high performance DSLs embedded in Scala. This talk will describe what it takes to build a Delite DSL and describe several real DSLs already being developed with Delite (including machine learning, graph analysis, and scientific computing). Delite handles parallel optimization and code generation, allowing DSL authors to focus on identifying the right abstractions and implementing domain-specific optimizations. We show that DSLs implemented with Delite achieve good performance compared to standard alternatives. Finally, we look ahead to what’s coming next with high performance DSLs and Delite.

Bioinformatics, statistical models and large-scale computations

Alexander Schliep, Rutgers, USA

In recent years molecular biology has seen a vast increase both in the number and size of datasets (i.e. genome sequences) and in the type of experiments performed. Consequently tools for fundamental tasks such as classification, clustering or segmentation of sequences have to be scaled to genome size data and adapted to the modified experimental setting. Using Hidden Markov Models (HMM) as an example, I will point out how this adaptation process leads to reimplementations of core HMM codes.

Reliable Generation of High-Performance Matrix Algebra

Jeremy Siek, University of Colorado at Boulder, USA

Scientific programmers often turn to vendor-tuned Basic Linear Algebra Subprograms (BLAS) to obtain portable high performance. However, many numerical algorithms require several BLAS calls in sequence, and those successive calls result in suboptimal performance. The entire sequence needs to be optimized in concert. Instead of vendor-tuned BLAS, a programmer could start with source code in Fortran or C (e.g., based on the Netlib BLAS) and use a state-of-the-art optimizing compiler. However, our experiments show that optimizing compilers often attain only one-quarter the performance of hand-optimized code. In this talk I present a domain-specific compiler for matrix algebra, the Build to Order BLAS (BTO), that reliably achieves high performance using a scalable search algorithm for choosing the best combination of loop fusion, array contraction, and multithreading for data parallelism. The BTO compiler generates code that is between 16% slower and 39% faster than hand-optimized code.

ForOpenCL: Transformations Exploiting Array Syntax in Fortran for Accelerator Programming

Matthew Sottile, Galois Inc, USA

This talk describes a small experimental extension to Fortran 2008 that was implemented via source-to-source transformations to hide the low-level OpenCL implementation of stencil-based algorithms written in Fortran. The basis of this work is the use of three notable features that were introduced in the 1990 Fortran standard: pure functions; elemental functions; and data parallel array syntax. This talk gives a brief overview of the concept and prototype that we developed based on the ROSE compiler infrastructure. The basic building block in ForOpenCL is the concept of a halo that defines for a computation on an element of an array the extent of the array around it that is required to be accessible. This pattern arises frequently in stencil-based programs. We are currently working on a deeper extension to Fortran itself, that we call “Locally Oriented Programming”, where type annotations are used to express information related to halos. This halo information can be used by the compiler to determine efficient memory allocation and transfer patterns in hybrid systems and message passing operations in MPI-based programs.

Programming language features required in high performance computing, parallel computing, and automatic tuning

Reiji Suda, University of Tokyo,Japan

The author is working on high performance computing, parallel computing,numerical algorithms, and automatic tuning (autotuning). Recently the requirements on such research are increasing because of several factors, for example, now every processor has multiple cores. One of the difficulties in high performance computing is that the optimal choice depends on HW, SW, data and environmental conditions. To attain high performance on many conditions, software must have adaptability in it, called autotuning. For autotuning, we need to generate several variants of programs (and to choose an appropriate variant for each condition). Such variants can be generated by code transformation or code generation. Needs of explicit language support are high in variations of data structure, algorithms and parallel processing. Language support for debugging and testing of such software is also expected.

Staging 15 years later

Walid Taha, Halmstad University,Sweden

Multi-stage Programming with Explicit annotations, or staging for short, is an approach to giving the programmer control over evaluation order. Staging was introduced in 1997 in a paper by Tim Sheard, my PhD advisor, and myself. Significant activity by numerous researches followed. Many view staging as a way of improving program performance. While technically very true, there is an important reason why this can be a misleading characterization. To address this problem, this talk briefly reviews staging, describes some important lessons learned about staging, and suggests some promising directions for future work.

Performance Tuning for High Performance Computing Applications

Daisuke Takahashi, University of Tsukuba, Japan

In this talk, basic performance tuning techniques for high performance computing (HPC) applications will be presented. Many HPC applications work well when the data sets fit into the cache. However, when the problem size exceeds the cache size, the performance of these applications decreases dramatically. The key issue in the implementation of HPC applications is minimizing the number of cache misses. Vectorization and parallelization are also important to achieve high performance on multi-core processors that have short vector SIMD instructions. Some performance results will be also presented.

Travel Information

Getting to Shonan Village

The following instructions cover the common cases for a non-resident of Japan to move from Tokyo and its airports to Shonan Village. In general, first take trains to either JR’s Zushi station or Keikyu’s Shin-Zushi station, then take a bus or taxi. The Shonan Village Web site gives more information, but please do not hesitate to ask the organizers for help!

If you get lost

Railway staff (especially at ticketing offices) tend to speak some English and know about destinations. Bus drivers, maybe less likely. Ken’s mobile phone number is 080-4717-2921 in Japan.

Step 1 of 2: tracks

You can search for train schedules online. However, if using the “Suica & N’EX” package, please focus on trains labeled “Narita Express” or “JR” and disregard the fares calculated.

From Tokyo’s Narita airport

Take JR trains to Zushi. JR is Japan’s “national” railway company.
At the airport after exiting customs, go downstairs and find a JR office (not a Keisei office). Buy the “Suica & N’EX” package (it saves you more than 2000 yen). Take “N’EX” (Narita Express) to Yokohama or Ofuna, then take the Shonan-Shinjuku or Yokosuka line to Zushi. Don’t use the Suica card yet – the N’EX ticket will cover the JR train trip all the way to Zushi (about 2 hours) and, if you buy the round-trip N’EX ticket, back. Tell the JR representative that you’re going to Zushi and you know you have to change trains at Yokohama or Ofuna. The Narita Express train splits en route so board the car of your assigned seat.

From Tokyo’s Haneda airport

Take Keikyu trains (not the monorail) to Shin-Zushi. Keikyu is one of Japan’s many private railway companies.
At the airport, buy a PASMO or Suica card with about 2000 yen. (You might get a slightly better deal if you buy a PASMO card from a Keikyu ticket office and tell them you are going to Shin-Zushi.) Use the card to go to Shin-Zushi terminal (about 1 hour), possibly changing trains at Keikyu-Kamata and/or Kanazawa-Hakkei.

From central Tokyo

Take JR’s Yokosuka line (e.g., from Tokyo or Shinagawa station) or Shonan-Shinjuku line (e.g., from Ikebukuro, Shinjuku, or Shibuya) to Zushi (about 1 hour). You can also take Keikyu trains (e.g., from Shinagawa) to Shin-Zushi.

Step 2 of 2: roads

A Keikyu bus departs once or twice per hour from Zushi and Shin-Zushi stations (bay 1) to Shonan Village (the last stop). It takes 30 minutes and costs 340 yen. The bus time table is online. The departure times from Shin-Zushi are 2 minutes after the departure times from Zushi. The underlined departures are bus route 26; all other departures are bus route 16.

Pay for the bus with your Suica or PASMO card by touching it to the blue “IC” panel when you get on (through the center door) and again when you get off (through the front door). It is when you get off that you pay, so don’t ask the driver for a ticket when you get on. After you get off, your destination building is right across the road (north, in other words).

A taxi from Zushi or Shin-Zushi station to Shonan Village takes 20 minutes and costs 2500 or 3000 yen.

Travel tips

Arranging for a ride together

Please add a comment to this post to announce your arrival information and arrange to share transportation. The comments will be deleted after the seminar.

Earthquakes

In eastern Japan, small earthquakes this year are more frequent than usual. If you have never experienced even a small earthquake, you may wish to check the page “What should I do when…?” just in case (just as you should know how to evacuate in case of a fire even if the chance is very small).

Overview

This discussion-heavy workshop aims to bridge the theory of programming languages (PL) with the practice of high-performance computing (HPC). The topic of the discussion will be code generation, or staging, a form of meta-programming. Both the PL and HPC communities have come to realize the importance of code generation: whereas PL theorists widely regard staging as the leading approach to making modular software and expressive languages run fast, HPC practitioners widely regard staging as the leading approach to making high-performance software reusable and maintainable. Thanks to this confluence of theory and practice, we have the rare chance to bring together PL researchers and the potential consumers of their work.

A particular area of current interest shared by PL and HPC researchers is how to use domain-specific languages (DSL) to capture and automate patterns and techniques of code generation, transformation, and optimization that recur in an application domain. For example, HPC has created and benefited from expressive DSLs such as OpenMP directives, SPIRAL’s signal processing language, and specialized languages for stencil computations and domain decomposition. Moreover, staging helps to build efficient and expressive DSLs because it assures that the generated code is correct in the form of precise static guarantees.

Alas, the communication between PL researchers working on staging and HPC practitioners could be better. On one hand, HPC practitioners often do not know what PL research offers. For example, current staging technology not only makes sure that the generated code compiles without problems, but further eliminates frequent problems such as matrix dimension mismatch. Staged programming can detect and prevent these problems early during development, thus relieving users from scouring reams of obscure generated code to debug compiler errors, or waiting for expensive trial computations to produce dubious results. On the other hand, PL researchers often do not know how much HPC practitioners who write code generators value this or that theoretical advance or pragmatic benefit—in other words, how the HPC wish list is ranked by importance.

This workshop aims to solicit and discuss real-world applications of assured code generation in HPC that would drive PL research in meta-programming. Specifically, we would like to determine:

  • how viable assured (MetaOCaml-like) meta-programming is for real-world applications;
  • how strong the demand is for static assurances on the generated code: well-formedness, well-typedness, numeric stability, absence of buffer overflows or dimension mismatch errors, etc.;
  • how important portability is, whether to move to a different target language or a different hardware platform;
  • which difficulties are “just” engineering (e.g., maintaining a viable, mature meta-programming system), which difficulties are of finding a good syntax, and which difficulties are foundational (e.g., code generation with binders and effects).

In short, we ask how program generation can or should help HPC.

The workshop participants consist of three groups of people: PL theorists, HPC researchers, and PL-HPC intermediaries (that is, people who are working with HPC professionals, translating insights from PL theory to HPC practice). The workshop would benefit PL and staging theorists by informing them what HPC practice actually needs. HPC practitioners may also benefit, for example by learning new ways to understand or organize the code generators they are already writing and using.

To promote this mutual understanding, the workshop will have lots of time for discussion, emphasizing presentations that elicit questions rather than merely provide answers.

Participants

  1. Ashish Agarwal
  2. Baris Aktemur
  3. Kenichi Asai
  4. Shigeru Chiba
  5. Albert Cohen
  6. Zhenjiang Hu
  7. Atsushi Igarashi
  8. Yukiyoshi Kameyama (organizer)
  9. Oleg Kiselyov (organizer)
  10. Andrei Klimov
  11. Ken Friis Larsen
  12. Frédéric Loulergue
  13. Geoffrey Mainland
  14. Takayuki Muranushi
  15. Tiark Rompf
  16. Alexander Schliep
  17. Chung-chieh Shan (organizer)
  18. Jeremy Siek
  19. Matthew Sottile
  20. Reiji Suda
  21. Eijiro Sumii
  22. Walid Taha
  23. Daisuke Takahashi
  24. Peter Thiemann