Apr 25, 2012
BINQ: A Domain Specific Language for Genomic Computations
Unabated advances in genomics technologies continue to increase the volume and variety of data that Biologists work with on a day-to-day basis. This results in two distinct software challenges: (i) Biologists require a platform that manages the deluge of data and methods they regularly employ, and (ii) this platform must parallelize code over multiple cores and nodes. I propose BINQ, a purely functional domain-specific-language (DSL) that allows expressing a wide range of computations commonly needed in genomics and systems biology. The type system aims to support the wide range of data and analyses required in a compact elegant language, and also to encode properties required for efficient implementation, such as whether a list is sorted. I will describe the challenges in implementing this language and explain where the parallelism opportunities lie. Finally, I will review our OCaml code base, which operates NYU’s Genomics Core Facility, and serves as motivation for the design of BINQ.
Combining Staged Programming and Empirical Optimization
A major motivation behind staged programming is to speed up programs by specializing them according to runtime inputs. Applications of staged programming we see in (theoretical) papers are mostly small examples that do not scale to large data used in HPC. To remedy this problem, staged programming must be performed with awareness of how to process large data efficiently. For instance, the generated program must utilize the cache effectively. However, given the complexity of modern computer architectures, this is not a trivial task. Empirical optimization (aka auto-tuning) techniques have been successfully used in HPC to adapt programs to target machines by selecting the best version as observed in install-time experiments. I will discuss our on-going work on combining empirical optimization with staged programming to select the best generator out of several candidates. This is a joint work with Prof. Sam Kamin and his research group.
A Calculational Framework for Parallel Programming with MapReduce
MapReduce, being inspired by the map and reduce primitives available in many functional languages, is the de facto standard for large scale data-intensive parallel programming. Although it has succeeded in popularizing the use of the two primitives for hiding the details of parallel computation, little effort has been made to emphasize the programming methodology behind. In this talk, I’d show that MapReduce can be equipped with a programming theory in calculational form. By integrating the generate-and-test programing paradigm and semirings for aggregation of results, we propose a novel parallel programming framework for MapReduce. The framework consists of two important calculation theorems: the shortcut fusion theorem of semiring homomorphisms bridges the gap between specifications and efficient implementations, and the filter-embedding theorem helps to develop parallel programs in a systematic and incremental way.
This is a joint work with Kento Emoto and Sebastian Fischer.
Towards application of supercompilation and metacomputation to high performance computing
I work in a team which has experience in two areas: 1) construction of program specializers based on the methods of supercompilation and partial evaluation (which we collectively refer to as metacomputation) and 2) construction of supercomputers and development of software for them. Based on this work, we see large potential impact of metacomputation on HPC, most notable being as follows.
- Our experience shows that powerful program specializers are capable of converting well-structured human-oriented, but badly parallelizable programs into “flat” code consisting of nested loops, for which existing parallelization methods apply well.
- It is common opinion that development of new models of computation and renewing of some old ones (like dataflow) is required, which poses the problem of transforming legacy code to new forms (e.g., Fortran code to dataflow, data parallel, SIMD, pipelined code). There is no silver bullet, but such deep program analysis and transformation methods are actually emerging.
- Another widespread viewpoint is that program verification becomes much more important in the parallel era than before. Between two approaches to program verification: based on logic methods and based on deep program transformation like supercompilation, we chose the later as more promising.
Embedded DSL for Stocastic Processes
We present a language for specifying stochastic processes, called SPL. We show that SPL can express the price of a range of financial contracts, including so called exotic options with path dependence and with multiple sources of uncertainty. Jones, Eber and Seward previously presented a language for writing down financial contracts in a compositional manner, and specified a pricer for these contracts in terms of an abstract financial model and abstract stochastic processes. For the subset of prices that do not require nested forecasting, these can be specified in SPL, and we show an example of how to do this. The ease of writing a model that matches reality and the speed of computing the expected price is then highly dependent on the properties of SPL.
SPL is designed with the goal of matching the notation used in mathematical finance, which allows high-level specification of stochastic processes. The language is deeply embedded in Haskell, and we have given the language formal semantics in terms of the probability monad, as well as a type system in terms of Haskell’s type system. We provide an implementation of SPL that performs Monte Carlo simulation using GP-GPU, and we present data indicating that this implementation scales linearly with the number of available cores.
MetaHaskell: Type Safe Heterogeneous Metaprogramming
Languages with support for metaprogramming, like MetaOCaml, offer a principled approach to code generation by guaranteeing that well-typed metaprograms produce well-typed programs. However, many problem domains where metaprogramming can fruitfully be applied require generating code in languages like C, CUDA, or assembly. Rather than resorting to add-hoc code generation techniques, these applications should be directly supported by explicitly heterogeneous metaprogramming languages.
In this talk I will present MetaHaskell, an extension of Haskell 98 that provides modular syntactic and type system support for type safe metaprogramming with multiple object languages. Adding a new object language to MetaHaskell requires only minor modifications to the host language to support type-level quantification over object language types and propagation of type equality constraints. We demonstrate the flexibility of our approach through three object languages: a core ML language, a linear variant of the core ML language, and a subset of C. All three languages support metaprogramming with open terms and guarantee that well-typed MetaHaskell programs will only produce closed object terms that are well-typed.
Paraiso: A code generation and automated tuning framework for explicit solvers of partial differential equations
As a simulation astrophysicist, my colleagues and I wish for something that (1) generates optimized codes on massively parallel computers such as the K computer, GPU-based supercomputers etc. (2) can describe practical (= complicated) algorithms and (3) is easy to program. None were available. Advances in programming languages made it possible even for physicists to design their own DSL. My hope is collaboration. Parallel and abstract languages act as interfaces, letting one group decompose their problems to such interface languages and other groups work on translating to real machines, and both groups work together in developing the interface.
As an example of such abstract interface I describe the Orthotope Machine and Paraiso, a DSL framework for generating parallel machine stencil computation, embedded in Haskell. Paraiso lets one write complicated hyperbolic partial differential equations (PDE) algorithms in succinct ways, by using mathematical language like type level tensors and algebraic type classes. Higher-order operations are also possible. Paraiso turns PDE algorithms into a program on the Orthotope Machine, a parallel virtual machine with very limited instruction set. These programs are then translated into multicore or GPU implementations. Paraiso automatically searches for faster implementation by benchmarking and evolutionary computation.
Staging and Linguistic Reuse: Maintaining Relative Evaluation Order Across Stage Boundaries
Staging enables safer and more productive development of program generators through linguistic reuse: expressing the code generator and pieces of generated code in the same language allows the generated code to inherit static guarantees such as syntax correctness, scope correctness and type correctness from the language of the code generator. However, staging approaches in the MetaML tradition do not provide linguistic reuse of evaluation order. This is a big open problem in practice. Adding staging annotations to a program can dramatically change the result and the runtime behavior of the program. The underlying cause of this problem is that the staging mechanism, quasiquotation, treats program fragments in a syntactic, context-free way and not as context-dependent statements. In this talk, we will discuss the problem of maintaining relative execution order within a stage and describe a possible solution that employs a context-sensitive form of quasiquotation. We will also briefly discuss how Lightweight Modular Staging (LMS) achieves safety guarantees through linguistic reuse without quasiquotation.
High Performance Embedded DSLs with Delite
Delite is a framework for building high performance DSLs embedded in Scala. This talk will describe what it takes to build a Delite DSL and describe several real DSLs already being developed with Delite (including machine learning, graph analysis, and scientific computing). Delite handles parallel optimization and code generation, allowing DSL authors to focus on identifying the right abstractions and implementing domain-specific optimizations. We show that DSLs implemented with Delite achieve good performance compared to standard alternatives. Finally, we look ahead to what’s coming next with high performance DSLs and Delite.
Bioinformatics, statistical models and large-scale computations
In recent years molecular biology has seen a vast increase both in the number and size of datasets (i.e. genome sequences) and in the type of experiments performed. Consequently tools for fundamental tasks such as classification, clustering or segmentation of sequences have to be scaled to genome size data and adapted to the modified experimental setting. Using Hidden Markov Models (HMM) as an example, I will point out how this adaptation process leads to reimplementations of core HMM codes.
Reliable Generation of High-Performance Matrix Algebra
Scientific programmers often turn to vendor-tuned Basic Linear Algebra Subprograms (BLAS) to obtain portable high performance. However, many numerical algorithms require several BLAS calls in sequence, and those successive calls result in suboptimal performance. The entire sequence needs to be optimized in concert. Instead of vendor-tuned BLAS, a programmer could start with source code in Fortran or C (e.g., based on the Netlib BLAS) and use a state-of-the-art optimizing compiler. However, our experiments show that optimizing compilers often attain only one-quarter the performance of hand-optimized code. In this talk I present a domain-specific compiler for matrix algebra, the Build to Order BLAS (BTO), that reliably achieves high performance using a scalable search algorithm for choosing the best combination of loop fusion, array contraction, and multithreading for data parallelism. The BTO compiler generates code that is between 16% slower and 39% faster than hand-optimized code.
ForOpenCL: Transformations Exploiting Array Syntax in Fortran for Accelerator Programming
This talk describes a small experimental extension to Fortran 2008 that was implemented via source-to-source transformations to hide the low-level OpenCL implementation of stencil-based algorithms written in Fortran. The basis of this work is the use of three notable features that were introduced in the 1990 Fortran standard: pure functions; elemental functions; and data parallel array syntax. This talk gives a brief overview of the concept and prototype that we developed based on the ROSE compiler infrastructure. The basic building block in ForOpenCL is the concept of a halo that defines for a computation on an element of an array the extent of the array around it that is required to be accessible. This pattern arises frequently in stencil-based programs. We are currently working on a deeper extension to Fortran itself, that we call “Locally Oriented Programming”, where type annotations are used to express information related to halos. This halo information can be used by the compiler to determine efficient memory allocation and transfer patterns in hybrid systems and message passing operations in MPI-based programs.
Programming language features required in high performance computing, parallel computing, and automatic tuning
The author is working on high performance computing, parallel computing,numerical algorithms, and automatic tuning (autotuning). Recently the requirements on such research are increasing because of several factors, for example, now every processor has multiple cores. One of the difficulties in high performance computing is that the optimal choice depends on HW, SW, data and environmental conditions. To attain high performance on many conditions, software must have adaptability in it, called autotuning. For autotuning, we need to generate several variants of programs (and to choose an appropriate variant for each condition). Such variants can be generated by code transformation or code generation. Needs of explicit language support are high in variations of data structure, algorithms and parallel processing. Language support for debugging and testing of such software is also expected.
Staging 15 years later
Multi-stage Programming with Explicit annotations, or staging for short, is an approach to giving the programmer control over evaluation order. Staging was introduced in 1997 in a paper by Tim Sheard, my PhD advisor, and myself. Significant activity by numerous researches followed. Many view staging as a way of improving program performance. While technically very true, there is an important reason why this can be a misleading characterization. To address this problem, this talk briefly reviews staging, describes some important lessons learned about staging, and suggests some promising directions for future work.
Performance Tuning for High Performance Computing Applications
In this talk, basic performance tuning techniques for high performance computing (HPC) applications will be presented. Many HPC applications work well when the data sets fit into the cache. However, when the problem size exceeds the cache size, the performance of these applications decreases dramatically. The key issue in the implementation of HPC applications is minimizing the number of cache misses. Vectorization and parallelization are also important to achieve high performance on multi-core processors that have short vector SIMD instructions. Some performance results will be also presented.