Abstract Memory-mapped transactions combine the advantages of both memory mapping and transaction... more Abstract Memory-mapped transactions combine the advantages of both memory mapping and transactions to provide a programming interface for concurrently accessing data on disk without explicit I/O or locking operations. This interface enables a programmer to design a complex serial program that accesses only main memory, and with little to no modification, convert the program into correct code with multiple processes that can simultaneously access disk.
Page 1. A Implementation of the Karp-Zhang Parallel Branch-and-Bound Algorithm Tiankai Liu under ... more Page 1. A Implementation of the Karp-Zhang Parallel Branch-and-Bound Algorithm Tiankai Liu under the direction of Prof. Charles E. Leiserson Massachusetts Institute of Technology Research Science Institute July 29, 2003 Page 2. Abstract This paper studies an implementation of the Karp-Zhang Parallel Branch-and-Bound algorithm on a shared memory machine. By employing it to solve a solitaire card puzzle, empirical data on the speedup of the algorithm is (going to be) obtained. Page 3.
Ф Ь м ОКМ з бйаи б да н ж з в зд аан гж и да н г ж гж а ззжггб а ийж зК Ф Ь м ОКМ да нз бйаи б гв... more Ф Ь м ОКМ з бйаи б да н ж з в зд аан гж и да н г ж гж а ззжггб а ийж зК Ф Ь м ОКМ да нз бйаи б гвз зи в г знв жгв о й гИ к гИ в Шгл жШг виЙзина за зК Св и гв иг гвижгаз гббгван гйв в бйаи Й б да н жзИ Ф Ь м ОКМ ийж з гвижгаз з в зд аан гж а ийж Йбйаи б да н зй з йзигб о а з дИ к ж а Йзд да н л и д и Йвгжб а о и гвИ в жглз а и б а в г за зК
Arguably, one of the biggest deterrants for software developers who might otherwise choose to wri... more Arguably, one of the biggest deterrants for software developers who might otherwise choose to write parallel code is that parallelism makes their lives more complicated. Perhaps the most basic problem inherent in the coordination of concurrent tasks is the enforcing of atomicity so that the partial results of one task do not inadvertently corrupt another task.
Abstract The difficulties in designing systolic processors can be reduced by applying the archite... more Abstract The difficulties in designing systolic processors can be reduced by applying the architectural transformations of code motion, retiming, slowdown, coalescing, parallel/serial compromises and partitioning to a more easily designed combinational or semisystolic form of the processor. In this paper, the use of these transformations and the attendant tradeoffs in the design of architectures for adaptive filtering based on the Gram-Schmidt algorithm are considered.
Abstract Existing concurrency platforms for dynamic multithreading do not provide repeatable para... more Abstract Existing concurrency platforms for dynamic multithreading do not provide repeatable parallel random-number generators. This paper proposes that a mechanism called pedigrees be built into the runtime system to enable efficient deterministic parallel random-number generation. Experiments with the open-source MIT Cilk runtime system show that the overhead for maintaining pedigrees is negligible.
JCilk is a Java-based multithreaded language for parallel programming that extends the semantics ... more JCilk is a Java-based multithreaded language for parallel programming that extends the semantics of Java by introducing" Cilk-like"[1][2] linguistic constructs for parallel control. The original Cilk language provides a dynamic multithreading model that supports call-return semantics in a C language context. The Cilk system also includes a provably good scheduler that guarantees programs can take full advantage of the resources available at runtime.
Abstract A stencil computation repeatedly updates each point of a d-dimensional grid as a functio... more Abstract A stencil computation repeatedly updates each point of a d-dimensional grid as a function of itself and its near neighbors. Parallel cache-efficient stencil algorithms based on" trapezoidal decompositions" are known, but most programmers find them difficult to write.
I have developed a multithreaded implementation of breadth-first search (BFS) of a sparse graph u... more I have developed a multithreaded implementation of breadth-first search (BFS) of a sparse graph using the Cilk++ extensions to C++. My PBFS program on a single processor runs as quickly as a standard C++ breadth-first search implementation. PBFS achieves high workefficiency by using a novel implementation of a multiset data structure, called a" bag," in place of the FIFO queue usually employed in serial breadth-first search algorithms.
The purpose of this programming assignment is to develop an understanding of asynchronous algorit... more The purpose of this programming assignment is to develop an understanding of asynchronous algorithms involving active messages. You will improve existing code that draws pictures of the so-called Mandelbrot set. Your Mandelbrot program will be written in C, and it will run on the xolas cluster. The assignment explores issues such as deadlock prevention, request-reply protocols, and termination protocols.
In this research, we address the problem of adaptive scheduling and resource allocation in the do... more In this research, we address the problem of adaptive scheduling and resource allocation in the domain of dynamic multithreading. Most existing parallel programming systems are nonadaptive, where each job is assigned a fixed number of processors. This policy places the burden of estimating the parallelism of the job on the programmer. In addition, nonadaptive scheduling may lead to a poor use of available resources.
In this research we address the problem of scheduling many adaptively parallel jobs on a multipro... more In this research we address the problem of scheduling many adaptively parallel jobs on a multiprocessor system [4, 5, 6]. An adaptively parallel job is a job that can change its parallelism in the course of its execution. Today, most multiprocessor systems use static allocation, where a fixed number of processors is allocated to the job for its lifetime. This policy places the burden of estimating the parallelism of the job on the programmer.
Abstract The term\ macro-level scheduling" refers to nding and recruiting idle workstations and a... more Abstract The term\ macro-level scheduling" refers to nding and recruiting idle workstations and allocating them to various adaptively parallel applications. In this thesis, I have designed and implemented a macro-level scheduler for the Cilk Network of Workstations environment. Cilk-NOW provides the\ micro-level scheduling" needed to allow programs to be executed adaptively in parallel on an unreliable network of workstations. This macro-level scheduler is designed to be hassle-free and easy to use and customize.
In this model we are adding numbers xj and excluding a neighborhood of xi. In the FMM, the xj bec... more In this model we are adding numbers xj and excluding a neighborhood of xi. In the FMM, the xj become representations of functions, which are accurate only at some distance from point i. This core, though perhaps obvious, was buried for many years. It took a trip to Japan and years of classroom presentations (Edelman, Leiserson), and a recent conversation over lunch at MIT (Demaine, Demaine, Edelman, Persson) before we could articulate the essence of the FMM.
This document describes CilkTM 1.2 (Version 1), a C language extension and its supporting runtime... more This document describes CilkTM 1.2 (Version 1), a C language extension and its supporting runtime system intended for developing continuation-passing style multi-threaded programs on CM-5. Cilk grew out of efforts in implementing a simple scheduling and execution model on top of CM-5's active message layer, and in adapting it to the needs of real life application programs.
Abstract The ftIO system provides portable and fault-tolerant le I/O by enhancing the functionali... more Abstract The ftIO system provides portable and fault-tolerant le I/O by enhancing the functionality of the ANSI C le system without changing its application programmer interface and without depending on system-speci c implementations of the standard le operations. The ftIO-system is an extension of the porch compiler and its runtime system. The porch compiler automatically generates code to save the internal state of a program in a portable checkpoint.
We provide new competitive upper bounds on the performance of the memoryless, randomized caching ... more We provide new competitive upper bounds on the performance of the memoryless, randomized caching algorithm RAND. Our bounds are expressed in terms of the inherent hit rate α of the sequence of memory references, which is the highest possible hit rate that any algorithm can achieve on the sequence for a cache of a given size. Our results show that RAND is (1-αe-1/α)/(1-α)-competitive on any reference sequence with inherent hit rate α.
An irreversible shift towards multicore x86 processors is underway. Building multicore processors... more An irreversible shift towards multicore x86 processors is underway. Building multicore processors delivers on the promise of Moore's Law, but it creates an enormous problem for developers. Multicore processors are parallel computers, and parallel computers are notoriously difficult to program.
Abstract Pochoir is a compiler for a domain-specific language embedded in C++ which produces exce... more Abstract Pochoir is a compiler for a domain-specific language embedded in C++ which produces excellent code from a simple specification of a desired stencil computation. Pochoir allows a wide variety of boundary conditions to be specified, and it automatically parallelizes and optimizes cache performance. Benchmarks of Pochoir-generated code demonstrate a performance advantage of 2–10 times over standard parallel loop implementations.
Open-nested transactions [2–5] have been proposed as a loophole for transactional memory (TM) to ... more Open-nested transactions [2–5] have been proposed as a loophole for transactional memory (TM) to increase concurrency on highly contended resources in transactional programs. Programs that use open nesting can be difficult to reason about because open nesting breaks serializability at the level of memory semantics. Evidence suggests that an unconstrained use of open nesting cannot be encapsulated, ie, that programmers may need to be aware of whether subroutines contain open-nested transactions.
Abstract Memory-mapped transactions combine the advantages of both memory mapping and transaction... more Abstract Memory-mapped transactions combine the advantages of both memory mapping and transactions to provide a programming interface for concurrently accessing data on disk without explicit I/O or locking operations. This interface enables a programmer to design a complex serial program that accesses only main memory, and with little to no modification, convert the program into correct code with multiple processes that can simultaneously access disk.
Page 1. A Implementation of the Karp-Zhang Parallel Branch-and-Bound Algorithm Tiankai Liu under ... more Page 1. A Implementation of the Karp-Zhang Parallel Branch-and-Bound Algorithm Tiankai Liu under the direction of Prof. Charles E. Leiserson Massachusetts Institute of Technology Research Science Institute July 29, 2003 Page 2. Abstract This paper studies an implementation of the Karp-Zhang Parallel Branch-and-Bound algorithm on a shared memory machine. By employing it to solve a solitaire card puzzle, empirical data on the speedup of the algorithm is (going to be) obtained. Page 3.
Ф Ь м ОКМ з бйаи б да н ж з в зд аан гж и да н г ж гж а ззжггб а ийж зК Ф Ь м ОКМ да нз бйаи б гв... more Ф Ь м ОКМ з бйаи б да н ж з в зд аан гж и да н г ж гж а ззжггб а ийж зК Ф Ь м ОКМ да нз бйаи б гвз зи в г знв жгв о й гИ к гИ в Шгл жШг виЙзина за зК Св и гв иг гвижгаз гббгван гйв в бйаи Й б да н жзИ Ф Ь м ОКМ ийж з гвижгаз з в зд аан гж а ийж Йбйаи б да н зй з йзигб о а з дИ к ж а Йзд да н л и д и Йвгжб а о и гвИ в жглз а и б а в г за зК
Arguably, one of the biggest deterrants for software developers who might otherwise choose to wri... more Arguably, one of the biggest deterrants for software developers who might otherwise choose to write parallel code is that parallelism makes their lives more complicated. Perhaps the most basic problem inherent in the coordination of concurrent tasks is the enforcing of atomicity so that the partial results of one task do not inadvertently corrupt another task.
Abstract The difficulties in designing systolic processors can be reduced by applying the archite... more Abstract The difficulties in designing systolic processors can be reduced by applying the architectural transformations of code motion, retiming, slowdown, coalescing, parallel/serial compromises and partitioning to a more easily designed combinational or semisystolic form of the processor. In this paper, the use of these transformations and the attendant tradeoffs in the design of architectures for adaptive filtering based on the Gram-Schmidt algorithm are considered.
Abstract Existing concurrency platforms for dynamic multithreading do not provide repeatable para... more Abstract Existing concurrency platforms for dynamic multithreading do not provide repeatable parallel random-number generators. This paper proposes that a mechanism called pedigrees be built into the runtime system to enable efficient deterministic parallel random-number generation. Experiments with the open-source MIT Cilk runtime system show that the overhead for maintaining pedigrees is negligible.
JCilk is a Java-based multithreaded language for parallel programming that extends the semantics ... more JCilk is a Java-based multithreaded language for parallel programming that extends the semantics of Java by introducing" Cilk-like"[1][2] linguistic constructs for parallel control. The original Cilk language provides a dynamic multithreading model that supports call-return semantics in a C language context. The Cilk system also includes a provably good scheduler that guarantees programs can take full advantage of the resources available at runtime.
Abstract A stencil computation repeatedly updates each point of a d-dimensional grid as a functio... more Abstract A stencil computation repeatedly updates each point of a d-dimensional grid as a function of itself and its near neighbors. Parallel cache-efficient stencil algorithms based on" trapezoidal decompositions" are known, but most programmers find them difficult to write.
I have developed a multithreaded implementation of breadth-first search (BFS) of a sparse graph u... more I have developed a multithreaded implementation of breadth-first search (BFS) of a sparse graph using the Cilk++ extensions to C++. My PBFS program on a single processor runs as quickly as a standard C++ breadth-first search implementation. PBFS achieves high workefficiency by using a novel implementation of a multiset data structure, called a" bag," in place of the FIFO queue usually employed in serial breadth-first search algorithms.
The purpose of this programming assignment is to develop an understanding of asynchronous algorit... more The purpose of this programming assignment is to develop an understanding of asynchronous algorithms involving active messages. You will improve existing code that draws pictures of the so-called Mandelbrot set. Your Mandelbrot program will be written in C, and it will run on the xolas cluster. The assignment explores issues such as deadlock prevention, request-reply protocols, and termination protocols.
In this research, we address the problem of adaptive scheduling and resource allocation in the do... more In this research, we address the problem of adaptive scheduling and resource allocation in the domain of dynamic multithreading. Most existing parallel programming systems are nonadaptive, where each job is assigned a fixed number of processors. This policy places the burden of estimating the parallelism of the job on the programmer. In addition, nonadaptive scheduling may lead to a poor use of available resources.
In this research we address the problem of scheduling many adaptively parallel jobs on a multipro... more In this research we address the problem of scheduling many adaptively parallel jobs on a multiprocessor system [4, 5, 6]. An adaptively parallel job is a job that can change its parallelism in the course of its execution. Today, most multiprocessor systems use static allocation, where a fixed number of processors is allocated to the job for its lifetime. This policy places the burden of estimating the parallelism of the job on the programmer.
Abstract The term\ macro-level scheduling" refers to nding and recruiting idle workstations and a... more Abstract The term\ macro-level scheduling" refers to nding and recruiting idle workstations and allocating them to various adaptively parallel applications. In this thesis, I have designed and implemented a macro-level scheduler for the Cilk Network of Workstations environment. Cilk-NOW provides the\ micro-level scheduling" needed to allow programs to be executed adaptively in parallel on an unreliable network of workstations. This macro-level scheduler is designed to be hassle-free and easy to use and customize.
In this model we are adding numbers xj and excluding a neighborhood of xi. In the FMM, the xj bec... more In this model we are adding numbers xj and excluding a neighborhood of xi. In the FMM, the xj become representations of functions, which are accurate only at some distance from point i. This core, though perhaps obvious, was buried for many years. It took a trip to Japan and years of classroom presentations (Edelman, Leiserson), and a recent conversation over lunch at MIT (Demaine, Demaine, Edelman, Persson) before we could articulate the essence of the FMM.
This document describes CilkTM 1.2 (Version 1), a C language extension and its supporting runtime... more This document describes CilkTM 1.2 (Version 1), a C language extension and its supporting runtime system intended for developing continuation-passing style multi-threaded programs on CM-5. Cilk grew out of efforts in implementing a simple scheduling and execution model on top of CM-5's active message layer, and in adapting it to the needs of real life application programs.
Abstract The ftIO system provides portable and fault-tolerant le I/O by enhancing the functionali... more Abstract The ftIO system provides portable and fault-tolerant le I/O by enhancing the functionality of the ANSI C le system without changing its application programmer interface and without depending on system-speci c implementations of the standard le operations. The ftIO-system is an extension of the porch compiler and its runtime system. The porch compiler automatically generates code to save the internal state of a program in a portable checkpoint.
We provide new competitive upper bounds on the performance of the memoryless, randomized caching ... more We provide new competitive upper bounds on the performance of the memoryless, randomized caching algorithm RAND. Our bounds are expressed in terms of the inherent hit rate α of the sequence of memory references, which is the highest possible hit rate that any algorithm can achieve on the sequence for a cache of a given size. Our results show that RAND is (1-αe-1/α)/(1-α)-competitive on any reference sequence with inherent hit rate α.
An irreversible shift towards multicore x86 processors is underway. Building multicore processors... more An irreversible shift towards multicore x86 processors is underway. Building multicore processors delivers on the promise of Moore's Law, but it creates an enormous problem for developers. Multicore processors are parallel computers, and parallel computers are notoriously difficult to program.
Abstract Pochoir is a compiler for a domain-specific language embedded in C++ which produces exce... more Abstract Pochoir is a compiler for a domain-specific language embedded in C++ which produces excellent code from a simple specification of a desired stencil computation. Pochoir allows a wide variety of boundary conditions to be specified, and it automatically parallelizes and optimizes cache performance. Benchmarks of Pochoir-generated code demonstrate a performance advantage of 2–10 times over standard parallel loop implementations.
Open-nested transactions [2–5] have been proposed as a loophole for transactional memory (TM) to ... more Open-nested transactions [2–5] have been proposed as a loophole for transactional memory (TM) to increase concurrency on highly contended resources in transactional programs. Programs that use open nesting can be difficult to reason about because open nesting breaks serializability at the level of memory semantics. Evidence suggests that an unconstrained use of open nesting cannot be encapsulated, ie, that programmers may need to be aware of whether subroutines contain open-nested transactions.
Uploads