[go: up one dir, main page]

Academia.eduAcademia.edu

CellPilot: A Seamless Communication Solution for Hybrid Cell Clusters

2011 40th International Conference on Parallel Processing Workshops

CellPilot: A Seamless Communication Solution for Hybrid Cell Clusters N. Girard, W.B. Gardner, J. Carter, G. Grewal School of Computer Science University of Guelph, ON, Canada {ngirard,gardnerw,jcarter,ggrewal}@uoguelph.ca Abstract—The CellPilot library provides a comprehensive interprocess communication solution for parallel programming in C on clusters comprised of Cell BE and other computers. It extends the process/channel approach of the existing Pilot library to cover processes running on Cell PPEs and SPEs. The same simple API is used to read and write messages on channels defined between pairs of processes regardless of location, while hiding communication details from the user. CellPilot uses MPI for inter-node communication, and the Cell SDK within a Cell node. Keywords-Cell Broadband Engine; parallel programming; MPI; high-performance computing I. INTRODUCTION A high-performance cluster constructed of Cell Broadband Engine (Cell BE) [1] nodes, perhaps with other heterogeneous nodes, presents formidable difficulties for a programmer of ordinary skill and knowledge to utilize. While the popular message-passing library MPI [2] can handle inter-node communication on a hybrid cluster, MPI is not integrated with the communication libraries needed for intraCell use. The hardware resources include the relatively slow Cell Power Processor Elements (PPEs), the fast Synergistic Processor Elements (SPEs), and fast non-Cell cores, but splicing together a solution from MPI and the Cell’s own libraries that would facilitate deploying a cluster-wide application is daunting. While it is true that not every HPC problem stands to benefit from the strengths of the Cell’s unique architecture, if the state of affairs exists such that an available Cell cluster is heavily underutilized while conventional clusters are crowded with users—precisely the situation in the HPC consortium associated with our university—it may be that the above challenges are at least partly to blame. Therefore, lowering the barriers for a would-be Cell cluster programmer to “get into the game” is a worthy goal. In this paper, we present a work in progress called CellPilot, which gives HPC programmers an alternate communication model that applies seamlessly to all processor resources in a cluster of Cells and non-Cell nodes. CellPilot was first introduced in concept at HPCS 2010 (High Performance Computing Symposium) [3]. It is based on extending the Pilot library [4], originally designed as an easier method for novice scientific programmers to write parallel cluster applications in C and Fortran, onto the Cell, using Pilot’s same process/channel abstractions and its same economical, stdio-inspired API with easy-to-master fprintf/fscanf metaphor. With CellPilot, programmers design software in terms of processes that can be located on any PPEs, SPEs, or nonCell nodes, and communications channels bound to process pairs. Programs are coded in terms of reading and writing on those channels, whereupon CellPilot transparently applies whichever communication mechanisms are required to transport the message, regardless of its endpoints. This gives the programmer a convenient unified view of all processor resources on the cluster, and a way to handle interprocess communication while avoiding low-level operations and multiple library APIs. The Pilot library itself is a thin, transparent layer on top of standard MPI; that is, it uses MPI as the transport mechanism. CellPilot is an extension of Pilot that provides communication to and from SPE processes, with low-level communication carried out by means of IBM’s Cell Software Development Kit (SDK) functions. This paper first presents the necessary background (Section II) on the Cell BE, its IBM-supplied communication libraries, the Pilot approach to cluster programming, and related work on Cell BE programming tools. Then the programming model provided by CellPilot is described in Section III, followed by the underlying implementation (Section IV) based on Pilot, MPI, and the Cell SDK. CellPilot is being used for several case studies through which performance data (Section V) can be obtained. Finally, the present status and availability of CellPilot is reported (Section VI), and the paper concludes. II. BACKGROUND A. Cell BE architecture The Cell processor is made up of 9 RISC cores: One Power Processor Element (PPE1) runs the operating system, typically Linux, and is the coordinator of the 8 Synergistic Processing Elements (SPEs) used for accelerated computa1The terms PPE and SPE seem to be used almost interchangeably with PPU (Power Processor Unit) and SPU (Synergistic Processing Unit), respectively. The reality is a bit more complicated, with the processor unit hardware being contained within the processing elements. For simplicity, this paper always refers to PPE and SPE. tion with their SIMD instructions, each having a small local storage space of 256KB. These 9 cores are interconnected via a specialized high-bandwidth circular data bus called the Element Interconnect Bus (EIB). Two Cell processors can be connected via their I/O elements to make a Cell blade. Typical high-performance Cell clusters are constructed using Cell blades and a handful of quad-core processors such as the Xeon, thus yielding a heterogeneous or hybrid cluster. A key reason for the Cell’s high performance is the fact that each SPE can directly address only its own independent memory, thereby obviating the need for expensive and latency-producing cache coherency hardware. Each SPE’s local store is only 256KB for the application code and data. Programmers must pay special attention not to exceed this limit, and may need to divide up their application code accordingly, for which an overlay capability is available. A by-product of this design is that programmers are compelled to move data explicitly between main memory and the SPEs’ memories. The architecture provides a number of complex mechanisms to accomplish this, including DMA transfers and mailboxes, and some have stringent address alignment requirements. For optimal DMA performance, the data to be transferred should be aligned to a quad-word address. SPE programs are launched when a PPE program creates a “context” associated with the desired executable (which was earlier embedded by a special linker into the PPE executable in the guise of initialized static data) and loads it onto an SPE under the control of a PPE POSIX thread that waits for the SPE program’s asynchronous completion. In this way, all SPEs can be kept busy computing in parallel. Thus, learning to program an individual Cell node successfully can already be difficult enough without introducing the added complications of off-node communication. I/O between two different Cell nodes must be done via the PPE, thus for SPEs of different nodes to intercommunicate requires three hops involving two PPEs. IBM provides libraries, described next, that at least allow a programmer to avoid coding assembly language instructions for data transfers. B. Cell communication libraries The Cell Software Development Kit (SDK) offers an API for creating, building, simulating and testing Cell applications. The SDK includes libraries in support of computation—vector and linear mathematics computation, Fast Fourier Transforms, access to SPE’s SIMD instructions at the C level—and data communication. One important library is the SPE Runtime Management Library (libspe2) which provides a low-level API for SPE management. Another is the Accelerated Library Framework (ALF), which offers a programming environment for data- and PPEornonͲCell processorcore HE AE AE HE HE Cell Cell AE Figure 1. Process hierarchy of DaCSH task-parallel applications. The Data Communication and Synchronization library (DaCS), included in the Cell SDK, provides services to ease the development of applications on the Cell BE in terms of a hierarchical topology of processing elements. DaCS includes resource and process management, and data communication services. In the process hierarchy, the PPE is considered the Host Element (HE) and its associated SPEs are Accelerator Elements (AEs). There is limited support for collective operations, scatter and gather, between the PPE and a list of SPEs. DaCS does have an extension, DaCS for Hybrid (DaCSH), that allows for off-node communication in a cluster. It effectively adds another layer to the process hierarchy, as illustrated in Figure 1. One non-Cell (x86-64) node is the HE for the cluster, and all Cell nodes (PPEs) are its AEs. These nodes are shown circled in the diagram. Each AE that resides on a PPE is also the HE of its own Cell, with its SPEs under its control as AEs. As will be seen below, this is somewhat similar to what CellPilot achieves using MPI. However, there are limitations to the DaCS library. On the local level, direct communication between SPEs is not supported due to the strongly hierarchical model of DaCS, and DaCSH does not have the flexibility to encompass a cluster of arbitrary node architectures. C. The Pilot approach Pilot [4, 5] is a library for high-performance computing built on top of MPI, adding a layer of abstraction that follows the Communicating Sequential Processes (CSP) [6] process-channel paradigm. Pilot’s API has a very small set of functions. Its communication calls are modelled on C’s stdio syntax for fprintf and fscanf, specifying first the channel, then the data format and length, and finally the values or variables. For example, the PI_Write() call below sends an array of 1000 floating point numbers on the given channel: PI_CHANNEL *workerdata; float data[1000]; ... PI_Write(workerdata,“%1000f”,data); It would be matched with a similar call to PI_Read() in the receiving process. The format is simply a convenient way to describe the data; it does not imply that the data is converted to text for transmission. And it need not be a string literal; it can be supplied by a variable. Pilot applications execute in two distinct phases: The first is the configuration phase where the static application architecture, comprising processes and channels, is defined by calling PI_CreateProcess() and PI_CreateChannel(). The program can easily learn how many total Pilot processes can be created (=number of MPI processes specified by mpirun), which is necessary for writing scalable applications that utilize every available processor. Defining a process means pointing to a function for it to run during the execution phase. The same function body can be associated with multiple processes, and an index parameter can be passed so it can identify its own instance, very much in the same style as POSIX pthread_create(). This phase is concurrently executed by every MPI process in the cluster, resulting in the construction of equivalent internal tables on the various processors, regardless of their respective word length, data alignment, and endian properties. The execution phase commences as each process invokes its associated function, except for MPI rank #0, also known as PI_MAIN, which has no additional associated function and simply continues executing statements in the main() function. During the execution phase, processes may write to and read from channels, which results in MPI messages being sent and received “under the hood.” The program ends when all processes return from their respective functions, and PI_MAIN executes PI_StopMain(). They all synchronize on an internal barrier before exiting. Pilot has a special way of providing access to a selected subset of MPI’s collective operations. During the configuration phase, a set of channels having a common endpoint can be designated as a bundle to be used for a specific purpose by calling PI_CreateBundle(). Bundle operations supported as of V1.2 are: broadcast, gather, and select. The first two have the usual MPI sense, while the select operation blocks until some channel in the bundle has data ready to read (so that a read on the channel would not block). The nomenclature is meant to suggest a Unix/ POSIX “select” operation on a set of file descriptors. Nonblocking operations are also available: checking whether a channel or a bundle has data to read. The key difference between MPI’s collective operations and Pilot’s is that MPI follows a pure SPMD convention. For example, if one process is broadcasting to 50 others, all 51 must execute MPI_Bcast(), which is arguably counterintuitive given that 50 are actually receiving. With Pilot, in contrast, only the broadcasting process calls PI_Broadcast() on the bundle; the 50 receivers each call PI_Read() on their respective channels. Thus, Pilot follows an MPMD convention akin to pthreads. Launching a Pilot application on a heterogeneous cluster is straightforward: keeping in mind that a Pilot application is just an MPI job, the programmer constructs an mpirun command (or whatever their site requires) that designates which nodes will run which executables. The first node will take on the identity of MPI rank #0 or PI_MAIN, then the rest of the nodes will become the other Pilot processes. MPI will take care of any conversions required between datatype lengths, endianness, and character codes. Pilot comes equipped with an integrated deadlock detection tool. This feature consumes one MPI process and is enabled simply by coding the option “-pisvc=d” on the command line (i.e., mpirun...-pisvc=d). Errors such as circular wait will cause the program to abort with a diagnostic message identifying the deadlocked processes. Pilot is available for downloading from its website [7], which also contains documentation and tutorials. Pilot is copyright by the University of Guelph and is not open source, but it can be used for free by anyone without any licensing formalities. Benefits of the Pilot approach to cluster programming include the elimination of categories of common parallel programming errors, such as one process mistakenly sending a message to another process that is unprepared to receive it, or coding errors with MPI rank or tag numbers. After the programmer configures an application’s process/ channel architecture, Pilot enforces that architecture at run time, prohibiting communication from taking place except over predefined channels, and reporting API misuses by source file and line number. With the optional activation of built-in deadlock detection, Pilot can also diagnose conditions that would cause mysterious program hangs using MPI alone. D. Related work Several researchers have created languages and libraries designed to ease programming of the Cell’s complex architecture. However, only one of these is directly applicable to a cluster of Cells; that is, an HPC programmer could choose a technique to apply on a per-Cell level, and then arrange for inter-node communication via some other mechanism such as MPI. First, partial versions of MPI have been programmed for use on a Cell. The core functions of MPI-1 have been implemented using a buffered approach for small messages [8] and a synchronous approach for large messages [9]. The Cell Messaging Layer (CML) is an implementation of a small subset of MPI that is usable on a single Cell processor or on Cell clusters [10]. CML assigns MPI ranks to all available SPEs, but not to PPEs, which are reserved for use by the library to carry out inter-Cell communication by means of conventional MPI. Available operations are MPI_Send and MPI_Recv, and the collective operations MPI_Bcast, MPI_Reduce and MPI_Allreduce, which are designed hierarchically. The limited implementation of these libraries made them infeasible candidates for CellPilot to build upon, since Pilot itself uses more of MPI. A key difference is that with CellPilot PPEs can host processes just like any non-Cell node. MPI Microtask [11] allows microtasks, essentially virtual SPEs written by the programmer, to be created dynamically at run time and assigned to a function. An MPI-style interface is used for communication. The MPI Microtask preprocessor decomposes the microtasks into basic tasks consisting of units of computation bounded by communication events, and groups together the basic tasks with strong dependencies. The preprocessor also pre-computes runtime parameters such as message buffer addresses which helps reduce the overhead in the runtime system. CellPilot does not require a preprocessor, but simply provides a higherlevel library interface for communication. The Charm++ Offload API [12] allows Charm++ work requests, which are blocks of work that do not have any data dependencies with any other executing work requests, to be off-loaded to the SPEs. The Offload API can be run independently from Charm++ as it is simply an API that allows C/C++ programs to be executed on the SPEs. CellPilot uses a different approach to spawning blocks of work: They are packaged as C functions that become executing processes in the program’s configuration phase, with CellPilot automatically handling the off-loading of processes to SPEs. The MultiCore Framework from Mercury [13] consists of a manager and multiple workers that communicate via “tiles” over channels. A distribution object, containing the entire data set and parameters of a portion of the algorithm, is used to create tile channels. Tiles are buffers that contain a small part of the data and are sent to the workers via the channels. Workers connect to a tile channel, receive a tile, perform computation on the data within the tile, replace the data with the results, and return the tile to the manager via the channel. CellPilot also depends on the channel abstraction, but does not provide a data organizational construct like tiles. Cell Superscalar (CellSs) [14, 15] has the objective of providing a simple and flexible programming model by attempting automatic parallelism. The programmer needs to annotate functions that can be run independently in parallel. A source-to-source compiler converts the annotated code into a Cell-compatible program by exploiting function parallelism. A runtime library builds a dependency graph that exposes data dependencies among the annotated functions, does the task scheduling and handles the data transfers between the processors. CellPilot does not attempt automatic parallelization. The programmer is entirely responsible for the application's parallelism. The Cell processor was designed to accommodate different programming paradigms. The streaming model, useful for gaming and high-definition television, is supported by Multicore Streaming Layer (MSL) [16]. MSL is a general runtime framework that employs a dataflow graph— where each node is an actor that performs some computation, and edges express data dependencies between input and output streams associated with those actors—for automatically determining pipeline or data parallelism within streaming applications. Since all data communication within CellPilot is blocking, it is not a good candidate for the stream programming model. Coconut (COde CONstructing User Tool) [17] parallelizes a program written in Haskell for the Cell processor by using a graph to represent the program's data and control flow. The tool allows the programmer to manipulate the graph in order to create a high-performance schedule of the work to be done on the SPEs. Coconut uses formal methodologies to ensure that the parallel version of a program is equivalent to its serial version and to verify that the schedule created by the programmer is valid, i.e., independent of any eventual execution order. CellPilot programmers do not require any knowledge of formal methods. Instead, it was designed to inherently embody formal principles of CSP in its process-channel model. Cellflow [18] is a programming toolkit offering both off-line and on-line facilities. Among the off-line facilities are a task allocator and scheduler that uses a Constraint Programming approach for optimizing allocation and scheduling using a task graph representing the application and hardware resources. A customizable application template allows programmers to identify task dependencies in their application and easily create a task graph. The on-line support includes a software library and high-level APIs which manage the communication and synchronization of the tasks using data queues, which are stored in a task table formed at start time; counters, which keep track of free slots in the data queue; and a series of semaphores, which are used for synchronization by signalling data transfers completion. In CellPilot, the SPEs are manually scheduled by the programmer during the execution phase and the CellPilot library does not attempt to optimize SPE scheduling. BlockLib [19] is a Cell abstraction built atop the NestStep platform, which is a bulk-synchronous parallel (BSP) address space language supporting nested parallelism, which makes the user code more portable between NestStep enabled systems. BlockLib provides skeleton functions for basic computation patterns and automates much of the memory management and parallelization. The library uses a small macro language to allow users to access SIMD optimizations without having to do hand optimization. Synchronization and inter-SPE communication are available outside of the BSP model. Synchronization is achieved using signals, and all transfers are double buffered and 128- byte aligned where possible. Synchronization is implicit within CellPilot as all communication is blocking, and there is no explicit way to synchronize processes such as a barrier operation. StarPU [20] is a high-level framework that abstracts heterogeneous processors such as Cell BE and GPGPU (general-purpose computing on graphics processing units), allowing applications to be written without needing to worry about the underlying hardware specific details. CellPilot extends the Pilot library by adding SPE functionality. Therefore it is only applicable to Cell systems. The Pilot library can still be used, with the exclusion of the SPE features, on MPI-enabled clusters. In terms of programming a single Cell computer, it can be said that CellPilot is not as ambitious as some of the above tools. The initiative for identifying code to run on SPEs is left in the programmer’s hands, and CellPilot’s basic contributions are easy methods to launch SPE processes and to handle interprocess communication. However, all of the above techniques, along with their greater ambition also have their significant learning curves, whereas with CellPilot, the programmer is coding in C as usual. Then, in the cluster context, CellPilot’s special contribution of cluster-wide communication and coordination applies. channels. Channel variables are typically global to the program, so that they can be initialized in main() and referenced in process functions. SPE processes can use the __ea “effective address” attribute to refer to globals in main memory, and the compiler will generate the necessary linkage code to resolve the address at run time. Thus, SPE processes can refer to channels and arrays of channels symbolically. Furthermore, only their address is utilized on the SPE, not the contents of the PI_CHANNEL data structure in main memory, which would require much more overhead to extract. III. CELLPILOT PROGRAMMER ’S MODEL IV. CELLPILOT IMPLEMENTATION Simply put, the design objective of CellPilot was to allow the Cell’s SPEs to participate as “equal citizens” as sites for processes and for channel-based communication in the Pilot programming model. If a programmer has already learned how to use Pilot on a conventional cluster, learning a couple more API functions for the SPE is a small matter. During the configuration phase, which is executing on the various nodes’ PPEs, each function that is intended to be run as an SPE process is defined with a PI_CreateSPE() call. The difference from the usual PI_CreateProcess() is that SPE processes are not automatically launched in the execution phase, as are all other Pilot processes. Rather, they must be explicitly launched by the process “in charge of” that Cell node (i.e., the PPE process) during its own execution phase. This is completely in keeping with the idea that SPEs have limited memory and may need to be loaded and reloaded with codes to perform small computations that are part of the overall application. It establishes a kind of process hierarchy, in that each set of SPE processes is controlled by its local PPE-based “parent” process. This is similar to the hierarchy of DaCSH, except that channels can directly route communication between processes at any level, as shown in Figure 2. The various “types” of channels, which are transparent to the programmer, are explained in Section B. In this way, all Pilot processes are globally defined throughout the application and are candidates for binding to The two key implementation issues concern the control of SPE-resident processes and the carrying out of channel communication across every combination of local vs. remote PPE or SPE, and off-node processes. These are described in the sections below. PPEornonͲCell processorcore Type3 Type1 PPE PPE Type2 SPEs Type5 Type4 Figure 2. CellPilot processes and channels A. SPE processes Just like regular Pilot processes, which are equivalent to MPI ranks, SPE processes are defined during the program’s configuration phase by calling PI_CreateSPE(). Instead of pointing to a C function to run (on the PPE), SPE processes are associated with an spe_program_handle_t (SDK typedef), which is an external symbol associated by the linker with some SPE object code that is embedded into the PPE executable file in the guise of static data. (This typedef is referred to by the macro PI_SPE_FUNC so that CellPilot configuration code can also compile on non-Cell nodes.) Such processes remain dormant until PI_StartSPE() is called by the “parent” PPE Pilot process. At that time, CellPilot takes care of spawning a pthread to load the code into an SPE by means of SDK functions, and waiting (in the background) for its completion. Prior to that, the programmer will have coded and compiled the SPE processes. Two cellpilot.h macros, PI_SPE_PROCESS(int,void*) and PI_SPE_END (see example in Figure 4), are used to bracket an SPE pro- cess function. The first one hides the code that transfers the arguments provided by PI_StartSPE(...,int,void*). Such arguments are especially useful when starting multiple instances of the same process function in data parallel programming, e.g., to give each instance a different index number or other parameter. The second macro hides the code that terminates the SPE program. B. 1 PPE or non-Cell Remote PPE or non-Cell write request to the Co-Pilot process. When Co-Pilot learns the address of the SPE process’s local memory buffer, it translates that into a main memory effective address (since SPE memory is mapped into main memory, from the PPE’s standpoint). Co-Pilot then uses that address in its own MPI call. The result is that the message transfers directly between the PPE’s buffer and the SPE’s local memory. Completion of the transfer is signalled to the SPE process by means of its mailbox. Note that this technique does not need recourse to DMA transfers. Type 3 transfers are handled the same as type 2, with the PPE or non-Cell Pilot process contacting the remote SPE’s Co-Pilot process via MPI. For type 5 transfers, both SPE processes send their buffer addresses to their respective Co-Pilot processes, which then make the transfer between themselves via MPI. Type 4 transfers do not involve MPI. Both SPE processes send their buffer addresses to the Co-Pilot process and wait for transfer completion confirmation. Whichever address arrives first is stored, then the Co-Pilot process polls for requests until the second SPE’s request arrives. Co-Pilot calculates the effective addresses corresponding to the two SPE buffers, transfers the data using memcpy, and then notifies each SPE process of completion via their respective mailboxes. Initially, we attempted to use pthreads to provide the Co-Pilot service. However, it seems that MPI is liable to be configured without support for threading. Rather than insisting on the system administrator’s reconfiguring and reinstalling MPI, we decided to make CellPilot use processes so that it will work with MPI_THREAD_SINGLE support. 2 PPE Local SPE C. 3 PPE or non-Cell Remote SPE 4 SPE Local SPE 5 SPE Remote SPE By way of example, the simple code in Figures 3 (PPE program) and 4 (SPE programs) runs on two Cell nodes. The PPE Pilot program is executed via mpirun. Each PPE process in turn starts one SPE process on its node (line 27 in main and line 10 in recvFunc), one of which writes an array of 100 integers to the other (line 43 writes; line 53 reads, with “*” illustrating the syntax for argument-supplied length). This is an example of using a Type 5 channel (Figure 2) which requires relaying through two PPEs. The channel is created on line 24 during the configuration phase, and the PI_CHANNEL variable is referenced in the SPE programs on lines 35 and 48. A longer example (too bulky to print here), involves three channel transfers: from one SPE process to its parent PPE process, from there to another node’s PPE process, and from there to its SPE process. That example took 80 lines to code using CellPilot. Recoding this example using the Cell SDK required 186 lines, and called functions such as mfc_put, mfc_write_tag_mask, mfc_read_tag_status, spu_write_out_mbox, spe_in_mbox_status, and so on. CellPilot channels Of the existing SDK libraries, DaCS is closest to what CellPilot wants to offer on the local node level, and can be a partial replacement for MPI on the cluster level. At least in part, CellPilot could have used DaCS in a similar manner to how Pilot uses MPI. However, investigation determined that DaCS (a) was not a comprehensive solution—in particular, it did not address SPE-to-SPE communication nor support collective operations across the cluster; and (b) linking it into every SPE program would simply take up precious space from the 256K limit. So we decided to use only the basic functions in libspe2 to implement CellPilot, while retaining Pilot’s relationship with MPI, and thus avoid DaCS altogether. Similarly, the programming model supported by ALF was judged to be too restrictive to be compatible with the Pilot paradigm, so ALF was not used in CellPilot either. Based on the possible locations of channel endpoints, Table I lists all types of channels that must be catered for. TABLE I. CELLPILOT CHANNEL TYPES Type Process Locations Type 1 transfers are between regular Pilot processes, and are handled by Pilot in the usual way, via MPI send/ receive. The other types require assistance from a PPE process. In order to avoid interfering with the PPE’s Pilot process (which may be occupied with its own computation or channel communication), a second MPI process, known as the Co-Pilot process, is created on each Cell node to provide services for these four channel types. Since Cell blades have two PPEs and each PPE has dual hardware threads, an added Co-Pilot process utilizes a computing resource that might otherwise go idle. For type 2 transfers, the PPE process treats this as an MPI send/receive between itself and the Co-Pilot process. On the SPE side, mailbox messaging is used to send a read/ Sample code 1 #include <stdio.h> 2 #include <pilot.h> 3 4 PI_SPE_FUNC spe_send, spe_recv; 5 6 PI_CHANNEL *betweenSPEs; 7 8 //--- Receiver PPE function --9 int recvFunc(int arg, void *ptr){ 10 PI_RunSPE(recvSPE, 0, NULL); 11 12 return 0; 13 } 14 15 //--- Sender PPE function --16 int main(int argc, char *argv[]){ 17 //configuration phase 18 int N = PI_Configure(&argc, &argv); 19 20 PI_PROCESS *recvPPE = PI_CreateProcess(recvFunc, 0, NULL); 21 PI_PROCESS *sendSPE = PI_CreateSPE(spe_send, PI_MAIN, 0); 22 PI_PROCESS *recvSPE = PI_CreateSPE(spe_recv, recvPPE, 0); 23 24 betweenSPEs = PI_CreateChannel(sendSPE, recvSPE); 25 //execution phase 26 PI_StartAll(); 27 PI_RunSPE(sendSPE, 0, NULL); 28 29 PI_StopMain(0); 30 return 0; 31 } Figure 3. PPE program Recoding using DaCS required less code at 114 lines, and called dacs_remote_mem_create, dacs_remote_mem_query, dacs_put, dacs_wait, dacs_remote_mem_release, and so on. The simplicity of the CellPilot approach is shown by its small code size. V. CELLPILOT PERFORMANCE The main metrics of interest to CellPilot users are the amount of SPE’s 256K memory consumed by the library, and the latency of channel communication. The CellPilot object file, cellpilot.o, takes up 10336 bytes of SPE storage (per the Linux size command). In comparison, the DaCS SPE library, libdacs.a, is 36600 bytes. With the CellPilot approach, a good deal of the logic is carried out on the PPE side by the Co-Pilot process where memory is abundant. Nonetheless, a pure SDK approach would be smaller, since CellPilot includes general purpose code for interpreting PI_Read and PI_Write format strings and carrying out mailbox messaging. For timing purposes, the Intel MPI Benchmarks’ [21] PingPong test was used, described in the IMB Users Guide as “the classical pattern used for measuring startup and 32 33 34 35 36 37 38 39 40 41 42 43 --- Sender SPE --//contents of spe_send.c extern PI_CHANNEL *__ea betweenSPEs; PI_SPE_PROCESS(int arg1, void *arg2) int Array[100], i; for(i=0; i<100; i++) Array[i]=i; PI_Write(betweenSPEs, "%100d", Array); 44 PI_SPE_END 45 46 --- Receiver SPE --47 //contents of spe_recv.c 48 extern PI_CHANNEL *__ea betweenSPEs; 49 50 PI_SPE_PROCESS(int arg1, void *arg2) 51 int Array[100], i; 52 53 PI_Read(betweenSPEs, "%*d", 100, Array); 54 55 for(i=0; i<100; i++) 56 printf("%d ", Array[i]); 57 printf("\n"); 58 PI_SPE_END Figure 4. SPE program throughput of a single message sent between two processes.” The cluster consists of 8 nodes of dual 3.2 GHz PowerXCell 8i processors, and 4 nodes of 4- or 8-core 2.5 GHz Xeon processors, with gigabit Ethernet interconnect, running RedHat Enterprise Linux 5.2 and Open MPI 1.2.8. Each data type supported by CellPilot was sent across each of the 5 channel types to measure communication latency. For types that can involve either PPE or non-Cell processes (type 1 and 3), the times given are for PPE endpoints only, which were slower than for the Xeon nodes. The time measured is that for a message to be sent and received 1000 times. The reported time is the measured time divided by the number of repetitions and halved to yield the average time in microseconds required for a transfer in one direction. The test performed for a single byte (“%b”) data type, and for an array of 100 long doubles (“%100Lf”) of 16 bytes each. Three kinds of tests were carried out: (1) via CellPilot, thus involving the Co-Pilot process; (2) via hand-coded SPE/PPE transfers using DMA; (3) and via hand-coded transfers using memory-mapped copying (i.e., CellPilot’s method, but without the generality of the Co-Pilot process). Timing results are given in Table II and graphed in Figure 5. For each bar, the lower solid portion represents the 1byte message time, and the upper hashed portion the 1600byte time. In addition, the throughput for the array case is graphed in Figure 6. Not surprisingly, Co-Pilot was found to add overhead compared to hand-coded transfers. Our Figure 5. Latencies for CellPilot vs. hand-coded transfers Figure 6. Throughput for CellPilot vs. hand-coded transfers TABLE II. Channel Type CELLPILOT VS. HAND-CODED TIMING (µS) Bytes CellPilot DMA Copy 1 1 1600 105 173 98 160 98 160 2 1 1600 59 76 15 15 15 30 3 1 1600 140 219 114 181 107 175 4 1 1600 112 123 30 30 30 60 5 1 1600 189 263 131 195 117 194 current analysis is that all SPE-connected channel types are paying some overhead for the Co-Pilot process. Moreover, type 2 uses MPI for the local PPE-to-Co-Pilot transfer, which could be a fast shared-memory copy, but nonetheless involves MPI processing in order to match the treatment of type 3 channels. Type 5 involves two Co-Pilot processes on separate Cell nodes. Given these insights, it is likely that CoPilot processing can be sped up in the future. VI. STATUS AND FUTURE WORK Although the CellPilot approach is new, and more testing is required, our initial results are encouraging. Currently, we are forging forward with various case studies for CellPilot, including the parallelization and implementation of scatter search, a well-known meta-heuristic that has been successfully applied to a variety of NP-hard problems, pri- marily in the areas of combinatorial optimization [22] and machine learning [23]. CellPilot will be made available for student projects in upcoming University of Guelph parallel programming classes. It will be available for free public downloading from the Pilot website [7]. As for future work, the main features of Pilot that currently do not apply to CellPilot are collective operations and deadlock checking. Pilot provides a subset of collective functions that are available to PPE and non-Cell processes, but CellPilot does not yet support collective operations among SPEs, much less involving a mixture of SPE and other processes. It may also be possible to optimize the operation of the Co-Pilot process and reduce its overhead to be more competitive with low-level, hand-coded methods. VII. CONCLUSION The two key innovations furnished by CellPilot are (1) the Co-Pilot PPE process, which effectively allows SPE processes to participate in MPI as first-class citizens; coupled with (2) the extension of Pilot’s “friendly face for MPI” to both PPE and SPE programs. The first innovation enables the delegation of all message handling, buffering, and synchronization to MPI which already has the proven code to do those things. The second means that programmers are spared from dealing with the complexities of both MPI and the Cell SDK. This was accomplished by adding only two function calls to the Pilot API (PI_CreateSPE and PI_RunSPE). CellPilot also has at least two strengths in its implementation: (1) The bulk of SPE messaging logic has been off-loaded onto the Co-Pilot PPE process, thereby conserving scarce SPE memory; and (2) the fact that SPE memory can be mapped into PPE effective addresses is used to set up direct transfers for MPI and intra-Cell messages, thus avoiding extra buffering and copying. In summary, CellPilot provides the following services to a Cell cluster programmer: (1) easy starting (with parameters) and stopping of SPE programs; (2) all kinds of communication between PPE, SPE, and non-Cell processes using a single, uniform API, while hiding the complications of DMA transfers, signals, mailboxes, alignment issues, and network transfers. The programmer still has to cope with the limited memory available on the SPEs. Using CellPilot, a scientific programmer is able to design and implement a cluster application based on the process/channel model, and utilize the processor resources of a Cell cluster without delving into low-level communication details. By having a single, unified programming model, one major difficulty—that of interprocess communication and synchronization—can be overcome. ACKNOWLEDGMENT This research was supported by NSERC (Natural Science and Engineering Research Council) of Canada. We gratefully acknowledge the use of the Cell cluster hosted by SHARCNET (Shared Hierarchical Academic Research Computing Network), and consultations with R. Enenkel (IBM Canada) and M. Perrone (IBM USA). REFERENCES [1] IBM developerWorks. Cell Broadband Engine resource center: Your definitive resource for all things Cell/B.E. [online, cited 03/13/2011]. Available from: http:// www.ibm.com/developerworks/power/cell/. [2] Message Passing Interface Forum. MPI: A message-passing interface standard version 2.2 [online]. September 2009 [cited 06/27/2011]. Available from: http://www.mpiforum.org/docs/mpi-2.2/mpi22-report.pdf. [3] N. Girard, J. Carter, W.B. Gardner, and G. Grewal. CellPilot: Seamless communication within Cell BE and heterogeneous clusters. Journal of Physics: Conference Series, 256(1):012002, June 5-9 2010. Poster paper. [4] J. Carter, W.B. Gardner, and G. Grewal. The Pilot approach to cluster programming in C. In Proc. of the 24th IEEE International Parallel & Distributed Processing Symposium, Workshops and Phd Forum, Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC10), pages 1–8, Atlanta, Apr. 23 2010. [5] J. Carter, W.B. Gardner, and G. Grewal. The Pilot library for novice MPI programmers. In Proc. of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’10), pages 351–352, Bangalore, India, Jan. 9-14 2010. Poster paper. [6] C.A.R. Hoare. Communicating sequential processes. Communications of the ACM, 21(8):666–677, 1978. [7] Pilot home [online]. Available from: http:// carmel.socs.uoguelph.ca/pilot. C and Fortran APIs copyright by University of Guelph. [8] A. Kumar, G. Senthilkumar, M. Krishna, N. Jayam, P.K. Baruah, R. Sharma, A. Srinivasan, and S. Kapoor. A Buffered-Mode MPI Implementation for the Cell BE Processor, volume 4487/2007 of Lecture Notes in Computer Science, pages 603–610. Springer Berlin / Heidelberg, 2007. [9] M. Krishna, A. Kumar, N. Jayam, G. Senthilkumar, P.K. Baruah, R. Sharma, S. Kapoor, and A. Srinivasan. A Synchronous Mode MPI Implementation on the Cell BE Architecture, volume 4742/2007 of Lecture Notes in Computer Science, pages 982–991. Springer Berlin, 2007. [10] S. Pakin. Receiver-initiated message passing over RDMA networks. In Proc. of the 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008), pages 1–12, Miami, Florida, April 2008. Available from: http:// ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4536262. [11] M. Ohara, H. Inoue, Y. Sohda, H. Komatsu, and T. Nakatani. MPI Microtask for programming the Cell Broadband Engine processor. IBM Syst. J., 45(1):85–102, 2006. [12] D. Kunzman, G. Zheng, E. Bohm, and L.V. Kale. Charm++, Offload API, and the Cell processor. In Proc. of the Workshop on Programming Models for Ubiquitous Parallelism at the Fifteenth International Conference on Parallel Architectures and Compilation Techniques (PACT-2006), Seattle, Sep. 1620 2006. [13] B. Bouzas, R. Cooper, J. Greene, M. Pepe, and M.J. Prelle. MultiCore Framework: An API for programming heterogeneous multicore processors. Technical report, Mercury Computer Systems, Inc., 2006. [14] P. Bellens, J.M. Perez, R.M. Badia, and J. Labarta. CellSs: A programming model for the Cell BE architecture. In Proc. of the 2006 ACM/IEEE conference on Supercomputing (SC ’06), page 5, New York, NY, USA, 2006. ACM. [15] J.M. Perez, P. Bellens, R.M. Badia, and J. Labarta. CellSs: Making it easier to program the Cell Broadband Engine processor. IBM J. Res. Dev., 51(5):593–604, 2007. [16] D. Zhang, Q.J. Li, R. Rabbah, and S. Amarasinghe. A lightweight streaming layer for multicore execution. ACM SIGARCH Computer Architecture News, 36(2):18–27, 2008. [17] C.K. Anand and W. Kahl. Synthesising and verifying multicore parallelism in categories of nested code graphs. In Michael Alexander and William Gardner, editors, Process Algebra for Parallel and Distributed Processing, Computational Science Series, chapter 1, pages 3–45. Chapman & Hall/CRC Press, 2009. [18] M. Ruggiero, M. Lombardi, M. Milano, and L. Benini. Cellflow: A parallel application development environment with run-time support for the Cell BE processor. Euromicro Symposium on Digital Systems Design, 0:645–650, 2008. [19] M. Alind, M.V. Eriksson, and C.W. Kessler. Blocklib: A skeleton library for Cell Broadband Engine. In Proc. of the 1st International Workshop on Multicore Software Engineering (IWMSE ’08), pages 7–14, New York, NY, USA, 2008. ACM. [20] C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. In Henk Sips, Dick Epema, and Hai-Xiang Lin, editors, Euro-Par 2009 Parallel Processing, volume 5704 of Lecture Notes in Computer Science, pages 863–874. Springer Berlin / Heidelberg, 2009. [21] Intel MPI benchmarks IMB [online, cited 06/28/2011]. Available from: http://www.intel.com/cd/software/products/ asmo-na/eng/219848.htm. [22] F. Gortazar, A. Duarte, M. Laguna, and R. Marti. Black box scatter search for general classes of binary optimization problems. Computers and Operations Research, 37(11):1977–1986, Nov. 2010. [23] S.C. Chen, S.W. Lin, and S.Y. Chou. Enhancing the classification accuracy by scatter-search-based ensemble approach. Applied Soft Computing, 11(1):1021–1028, Jan. 2011.