## A Case of System-level HW/SW Co-design and Co-verification of a Commodity Multi-Processor System with Custom Hardware Sungpack Hong\*, <u>Tayo Oguntebi</u>, Jared Casper, Nathan Bronson\*, Christos Kozyrakis, Kunle Olukotun Pervasive Parallelism Lab Stanford University \* This work was done while the authors were at Stanford. # Trends in hardware design - Moore's Law still in effect - But no more free lunch...performance is no longer free - Parallelism - Due to limited clock frequency - Many CPUs rather than a single super-fast one. - Heterogeneity - Due to limited power budget - Specialized HW for a specific task #### Questions Raised... - This trend leads us to a system with ... - Multiple CPUs - Custom HW units - All working concurrently for a single program - More than simple time-sharing - Questions - What does such a system look like? - Do we have proper design/verification methodology for such a system? - If not, what are the issues? - This is a case-study presentation to explore these questions rather than to answer them. # Our system as a case study - External hardware acceleration of software transactional memory on commodity CPUs - ... wait, what? - Several x86 CPUs - Two sockets, each AMD quad core - All running a single multi-threaded application - Plus a custom HW - A FPGA, attached coherently to CPUs - Accelerating a special software library, called Software Transactional Memory (STM) - All working in parallel - Okay then, what is STM? # Backgrounds: Transactional Memory - Parallel programming is hard - data races... - Related issues: dead-lock, live-lock, ... - Transactional Memory (TM) - A proposal to simplify parallel programming - The programmer simply declares critical regions in the program as transactions and puts all shared reads/writes inside - The runtime system (a.k.a. transactional memory) detects all the runtime data races - The runtime roll-backs conflicting transactions and thus guarantees serialize-ability of the program execution. # TM Example: Programming ``` void money transfer(int account[], int from, int to, int amount) { Begin BEGIN TX(); transaction int from before = READ(account[from]); Read shared int to before = READ(account[to]); variables int from after = from before - amount; Local int to after = to before + amount; computation Write shared WRITE (account[from], from after); variable WRITE (account[to], to after); computation END TX(); End transaction Each transaction is guaranteed ``` to be "atomic" # TM Example: Runtime #### Back to our case - STM: all in SW → lots of overhead, slow - HTM: all in HW → Requires CPU core modification - ... but you'd rather avoid changing a commodity core's RTL #### Our approach - Part in HW, rest in SW (a hybrid approach) - External custom HW - Sits outside cores (on FPGA) via memory bus - No core modification required - External communication takes some time but we know how to mitigate this! #### Our idea in a nutshell Each core sends a message to FPGA, for each read/write. + Some ideas for Core 2 latency hiding on Core 1 **FPGA** the SW-side $C \rightarrow B$ $A \rightarrow C$ 1:Read A 1:Read C 2:Read C **FPGA** detects violation and 1:Write A sends "Wow, this is notification so cool. Let's 1:Write C build this thing already!" 1:Ok To Commit? + Some ideas for fast violation 1:OK detection on the HW-side 2:Rollback # Our initial co-design - How to design a closely-coupled system? - New HW - New software library (STM) that uses the new HW - Let's simulate it! - Design HW/SW interface (i.e. communication protocol) with a cycle-based x86 ISS (Instruction-Set simulator) - Custom HW → pure virtual model - Develop the SW on top of simulation - What about HW design? - Do RTL design separately - Find RTL bugs with unit test - High-level protocol is already validated with simulation # And there we go ... - Our simulation was successful - Our protocol works and is faster than conventional STM - We obtained a HW framework - Two sockets, each with AMD quad-core - An FPGA connected via HyperTransport® - We implemented a coherent cache on FPGA\* - It was a (re-usable) part of our design - We implemented the custom HW as we designed - All the unit tests are passed - So we ran the whole system .... - ... And it didn't work - Transactions were not atomic at all What happened..? (In retrospect) Read happens here Core 2 Core 1 **FPGA** Core 2 Core 1 **FPGA** 2:Read C 2:Read C 1:Write C 1:Write C FPGA can't detect the 1:Ok To 1:Ok To conflict (thinks it's a Commit? Commit? valid read-after-write) 1:OK 1:OK 2:Rollback The message delivery has been delayed (out of order) What we designed What actually happened ## How could we have missed that? #### Problems with our simulator - We used a detailed x86 ISS - All instructions included and some cache protocols. - But the simulated interconnect was far from that of real HW system... - HyperTransport + external pin-out + FPGA ... - The simulation was in-order and deterministic - no latency variance #### Problems with unit testing - Cannot generate the complicated error sequence! - Requires a lot of interaction with software ## A Futile Resistance - "Hey, we already have the FPGA implementation. Let's just debug it (with a logic analyzer)." - Problem - The 'time span' of a typical error is very long - It is not clear when and how the problem happens - i.e. Error not detectable by a simple trigger ## What do we need? - Verification of a concurrent system - Interleaving of parallel executions - Out-of-order message delivery - Many different interleaving in a short time (i.e. fast execution) - Resemblance to the actual system - Actual HW (RTL) + Actual SW debugging preferred - Minimum modification for verification - Crucial features for verification - Deterministic replay the exact same interleaving should be generated at will - A better mechanism for bug finding than waveform view - Easier log analysis, at least # Comparisons of Available Tools | Method | Pros | Cons | |------------------------------------------------------------------------------------|---------------------------------------------------------------|----------------------------------------------------------------------------------------------------| | Prototyping | <ul><li>Target HW + SW</li><li>Fast execution</li></ul> | <ul><li>Limited visibility</li><li>No deterministic replay</li></ul> | | Full RTL sim.<br>(CPU + interconnection +<br>Custom HW) | <ul><li>Target HW + SW</li><li>Deterministic replay</li></ul> | <ul><li>All RTL not available</li><li>Too slow</li><li>No variation of interleaving</li></ul> | | Binary instrumentation (i.e. PIN-based simulation) | <ul><li>Target SW</li><li>Fast execution</li></ul> | <ul><li>No HW debugging</li><li>No deterministic replay</li></ul> | | Instruction-set sim.<br>+ RTL sim (or virtual HW) | <ul><li>Target SW</li><li>Deterministic replay</li></ul> | No variation of interleaving | | SW Model<br>+ network sim.<br>(Bus Functional Model)<br>+ RTL sim. (or virtual HW) | Faster than ISS | <ul><li>SW modification</li><li>Variation of interleaving?</li><li>Deterministic replay?</li></ul> | - Connect ISS with network sim (BFM) + RTL sim - Add various interleavings? #### [Option 2] Modify Target SW - Connect SW with BFM + RTL sim - Add various interleavings - Add deterministic replay #### • Easier to do (you know a lot more about SW than simulator) Faster to run # ISS-based approach (illustration) ``` void READ(...) { inline void foo(...) { HW check status(); int HW check status(...) BEGIN TX(); int ... = READ(...); return some processing(); *FPGA ADDR & bitmask; local compute(); WRITE (...); HW send msg(); ... } END TX(); HAL (HW Abstraction Layer) TM library User Program Do we Compile need CPU Where / How do we add various interleavings, simulation i.e. which simulator do we want to modify? at all? Binary A lot of simulation overhead ISS Network RTI Simulator Simulator Simulator (CPU) (HyperTransport) (Custom HW) (cycle-based) (cycle-based) (event-based) ``` - Deterministic Concurrency Control - BFM itself is single-threaded - BFM uses light-weight threads (i.e. fibers) to implement user threads in the applications - Contexts switch happens at network packet injection - Fast execution - All the local computations are natively executed - No CPU simulation at all - We only need software interacting with HW simulation - Do not waste simulation cycles for computation - Variable interleaving of concurrent executions - + Deterministic replay - All the local computation happens at a cycle - Actual packet delivery time is deterministic - Insert random idle cycles before packet injection - Not meant to compensate for computation time - But inserts deterministic variation in concurrent executions - Interleaving is dependent solely on random seed - Deterministic re-play → use the same random seed ``` int BFM_SIM_Read(...) { BFM_Idle_Cycles(get_random()); Context_Switch(SIM); ... Network_Inject_packet(READ_REQ, ...); Context_Switch(SIM); } ``` - Convenient error analysis - Logging at high-level - at packet level, or - at HAL level | W 1000786h | | |-------------------|---------------------------------| | W 1000786h | | | | | | | | | | | | – TX commit – | | | C 1000786h | | | - TX end - | | | - | | | Commit Okay. | 1 | | olated by 100786h | | | | - TX end -<br>-<br>Commit Okay. | - Automatic error detection - Simulation = shared-memory, single threaded, deterministic execution - Each user thread can see what other threads are doing - Further modify STM - → Maintain a shadow data-structure that checks conflicts on-line (only works for simulation) #### Worked well for our case Fast simulation enabled many different interleaving of concurrent executions in short time - With this environment, we actually designed and debugged - The Custom HW (RTL) - The new SW (STM library) - And the new communication protocol (system) - All together ## Generalization and Pitfalls #### Key insights - SW modification is easier than simulator modification - Local computation can be natively executed - Only global communication is simulated via Network simulation - Caveat: Ease of SW modification - Assumes that you can identify HW interface easily - Assumes that you can distinguish local computation and global communication (i.e. shared data access) - Usually true - Parallel SW designed with HAL and critical sections - But you should check your SW... - Not suitable for performance estimation # Requests for CAD Researchers - Our approach was still ad hoc ... - Is there a more systematic solution? - Part-wise selection of details of simulation - (e.g) Native SW execution (for local computation) - + Detailed HW simulation (for custom HW design) - + Detailed network simulation - Randomizing variance of concurrent executions - Should be deterministically re-playable # Summary - Co-design and Co-verification for post Moore's law era - Parallelism and Heterogeneity - Potential concurrency issues at design time - Required Features - Variable interleaving of concurrent executions - Deterministic Replay - Fast execution time + sufficient of visibility - In our case study - We used SW Model + BFM (interconnection) + RTL sim - SW modification was easier than ISS improvement - Hope there can be a generalized solution # Questions?