Advances in Multiprocessor Programmability Using Coarse-Grain Execution
Assistant Professor Luis Ceze (Computer Science and Engineering, University of Washington)
Learn about essential aspects of automatic and automated translation. The first lecture will introduce several fundamental facts about MT, MT systems, and their variety. In particular, it will show that it is far too simplistic to divide MT systems into "statistical" and "rule-based", and that the popular belief among researchers that all operational MT systems are now statistical is totally erroneous. The next three lectures will present a deeper analysis of the linguistic, computational and operational (engineering) architectures of MT systems. The last lectures will go deeper in two important questions: how to evaluate MT "in operation", and how to model and handle "parallel corpora for modern MT", which (will) have to be multilingual, multi-annotated, and multimedia.
Luis Ceze is an Assistant Professor in the Computer Science and Engineering Department at the University of Washington. His research focuses on computer architecture, compiler, programming models and OS to improve the programmability and reliability of multiprocessor systems. He has co-authored over 30 papers in these areas, and had three papers selected as IEEE Micro Top Picks. He participated in the Blue Gene, Cyclops, and PERCS projects at IBM and is a recipient of several IBM awards, including an IBM PhD Fellowship. He obtained his PhD in Computer Science from UIUC in 2007 and has received awards for research and academic accomplishments, including the Ross Martin Award for Outstanding Research Achievement in the College of Engineering, the David Kuck Outstanding PhD Thesis Award, and NSF CAREER Award. He recently co-founded a startup company where he is a part-time consultant.
National Institute of Informatics, 20F, Meeting Room
Transactional Memory (TM), Thread-Level Speculation (TLS), and Checkpointed multiprocessors are three popular architectural techniques based on the execution of multiple, cooperating speculative threads. In these environments, correctly maintaining data dependences across threads requires mechanisms for disambiguating addresses across threads, invalidating stale cache state, and making committed state visible. These mechanisms are both conceptually involved and hard to implement. In this talk, I will present Bulk, a novel approach to simplify thesemechanisms. The idea is to hash-encode a thread's access informationin a concise signature, and then support in hardware signatureoperations that efficiently process sets of addresses. Such operationsimplement the mechanisms described. Bulk operations areinexact but correct, and provide substantial conceptual and implementation simplicity. I will discuss an evaluation of Bulk in the context of TLS using SPECint2000 codes and TM using multithreaded Java workloads. Despite its simplicity, Bulk has competitive performance with more complex schemes.
While Sequential Consistency (SC) is the most intuitive memory consistency model and the one most programmers likely assume, current multiprocessors do not support it. Instead, they support more relaxed models that deliver high performance. SC implementations are considered either too slow or ? when they can match the performance of relaxed models?too difficult to implement. In this talk, I will present Bulk Enforcement of SC (BulkSC), anovel way of providing SC that is simple to implement and offersperformance comparable to Release Consistency (RC). The idea isto dynamically group sets of consecutive instructions into chunksthat appear to execute atomically and in isolation. The hardwareenforces SC at the coarse grain of chunks which, to the program, appears as providing SC at the individual memory access level. BulkSC keeps the implementation simple by largely decoupling memory consistency enforcement from processor structures. Moreover, it delivers high performance by enabling full memory access reordering and overlapping within chunks and across chunks. I will describe a complete system architecture that supports BulkSC and show that it delivers performance comparable to RC
Current shared memory multicore and multiprocessor systems are nondeterministic. Each time these systems execute a multithreaded application, even if supplied with the same input, they can produce a different output. This frustrates debugging and limits the ability to properly test multithreaded code, becoming a major stumbling block to the much-needed widespread adoption of parallel programming. In this talk I will make the case for fully deterministicshared memory multiprocessing (DMP). The behavior of anarbitrary multithreaded program on a DMP system is only afunction of its inputs. The core idea is to make inter-threadcommunication fully deterministic. Previous approaches tocoping with nondeterminism in multithreaded programs have focused on replay, a technique useful only for debugging. In contrast, while DMP systems are directly useful for debugging by offering repeatability by default, we argue that parallel programs should execute deterministically in the field as well. This has the potential to make testing more assuring and increase the reliability of deployed multithreaded software. We propose a range of approaches to enforcing determinism and discuss their implementation trade-offs. I will show that determinism can be provided with little performance cost using our architecture proposals on future hardware, and that software-only approaches can be utilized on existing systems.
Writing shared-memory parallel programs is error-prone. Among the concurrency errors that programmers often face are atomicity violations, which are especially challenging. They happen when programmers make incorrect assumptions about atomicity and fail to enclose memory accesses that should occur atomically inside the same critical section. If these accesses happen to be interleaved with conflicting accesses from different threads, the program might behave incorrectly. Recent architectural proposals arbitrarily group consecutive dynamicmemory operations into atomic blocks to enforce memory orderingat a coarse grain. This provides what we call implicit atomicity,as the atomic blocks are not derived from explicit programannotations. In this talk, I will make the fundamental observationthat implicit atomicity probabilistically hides atomicity violations by reducing the number of interleaving opportunities between memory operations. We then propose Atom-Aid, which creates implicit atomic blocks intelligently instead of arbitrarily, dramatically reducing the probability that atomicity violations will manifest themselves. Atom-Aid is also able to report where atomicity violations might exist in the code, providing resilience and debuggability. I evaluate Atom-Aid using buggy code from applications including Apache, MySQL, and XMMS, showing that Atom-Aid virtually eliminates the manifestation of atomicity violations.