Ieee Papers
Ieee Papers
Ieee Papers
Graham R. Hellestrand VaST Systems Technology Corporation 1250 Oakmead Pkwy, Ste 310, Sunnyvale, CA 94070 USA Abstract
The engineering of embedded systems optimized for particular purposes involves the mastery of multiple engineering disciplines and technologies, including: software engineering, digital and analog electronics engineering, networked systems engineering, operating system and compiler technology, processor micro -architecture technology, multi-processor platform technology, simulation technology and (i) for wireless products radio frequency technology and engineering, (ii) for automotive control products mechanical engineering and complex distributed control systems. The methodologies and tools for building optimal systems has begun to be available over the past 3-4 years and, where deployed, are having a revolutionary impact on companies and industries. As the effects of economic globalization become more pervasive, the differentiation of successful companies will focus more on the excellence of engineering and its ability to rapidly build purpose driven families of software-hardware electronic controllers optimized for families of products within and across market sectors. This paper deals with the technologies, methodologies and tools required to enable engineering teams to develop a competitive, often first-mover, advantage in the highly competitive global market. The use of engineering as a competitive advantage read weapon will distinguish the winners from the losers through this protracted market inflection point.
The ability to support data-driven decision making early in the systems development process is one of the underlying drivers of building models of systems that are timing accurate and high performance. Optimizing systems across the dimension of speed, response latency, cost (size) and power consumption is rarely done and, at a presilicon level, it is an undertaking only possible using high-performance, timing accurate models called Virtual System Prototypes (VSPs) in this paper. It is known that poor software and inefficient algorithms have a 1 st order effect on an embedded systems performanc e, as does hardware architecture (bus skeleton, memory hierarchy, etc.) at the system level. This is difficult to reconcile with practice, when next-generation product planning often has prime focus of processor microarchitecture , regardless of the fact that iterative microarchitecture improvement typically yields a 2nd or 3 rd order effect. The question is what has happened to software in optimizing systems? Embedded systems may be very complex as the dual processor network switching subsystem in Figure 1, below. Even though the hardware may be complex in such modern systems, the software is yet more complex and dominates the engineering process and budget. It is negligent not to employ empirical methods as a normal part of the development of software in real-time, embedded systems.
Application software (Viterbi), on INT will shuffle data from DRAM to MemBanks
SC1200
DMA Maste r Core Maste r
SC1200
DMA Maste r Core Maste r
DRAM (2MB)
32 32
32
Bridges
Bridges
Bridges
Bridges
Bridges
Bridges
Functional Requirements. This is an abstract, often mathematical model of the system that is executable and empirically testable. Such models are constituted from inter-communicating tasks/processes created using Simulink, UML, or some other programming system. The 1 st version of the Initial Executable Architecture is likely to have sufficient characteristics to be able to represent the underlying system with (i) a high level of functional fidelity and rudimentary timing or (ii) with a high level of timing fidelity and rudimentary function. It is imperative that executable models created in the Architecture Driven process are finely instrumented so that measurement of timing, behaviour and underlying event activity is readily available from all levels of the model. This data is used to drive the iterative refinement of the abstract system resulting in high levels of both function and timing fidelity. What the abstract models lack, intentionally, is underpinning physical realization detail, more of which becomes defined during the process of mapping the Initial System Architecture to an Executable System Specification or Virtual System Prototype.
Business Requirements
Hardware
The NP-hard problems of mapping an Initial System Architecture to underlying physically-mappable structures, realized as software and models of electronic hardware, mechanical, radio-frequency, and other devices (the VSP), is a current area of intense research activity [3]. All of the credible mapping strategies involve empirical experimentation using actual or stub software and models of the electronic hardware and other devices. The driving of the structural, functional and timing parameters of the software and hardware subsystems using Design of Experiments [4] techniques leads to optimized systems for specified purposes. It is important for the VSP to be optimized prior to commitment to physical realization as software + silicon or FPGA + external devices. Attempts at post-facto system optimization during realization are expensive and dangerous and leads to significant degradation of the engineering process. The surviving VSP is now the golden reference model to be used concurrently for software development and iterative refinement of the electronic hardware (called a Virtual Prototype or Platform) to an implementation in silicon or some other physical technology. The inherent concurrency in the engineering process reduces the effect of software being almost always on the critical path in modern electronic systems engineering . A comparison of the Conventional Hardware Software Design Process and the architecturally driven Quantitative Systems Engineering Process, with quantitative empirical experimentation driving optimization, is shown in Figure 3, below. The immediate impact is the movement of peak resource deployment in a project from the last third of a project to the first third. This is due , mainly, to the up-front development
of an optimized system architecture that enables a running start in software development, including porting of legacy code and operating systems, and an early start on reducing the Virtual Prototype hardware to physical reality. The data used in deriving the Conventional Process curves is from the development of a 2.5G cell phone . This data is reused to n ormalize data from a customer building a 2.5G cell phone , but using the Quantitative Systems Engineering Process. The comparison is not perfect, but the strong difference is verified in practice.
Architecture Hardware - ASIC Devel Software/Firmware Systems Integration + V&V Overall Project
No Prototype Engineering Process - Category (Arch, H/W, S/W, System) Data Risk
40 799 12
Poly. (Architecture) Poly. (Overall Project) Poly. (Software/Firmware) Poly. (Hardware - ASIC Devel) Poly. (Systems Integration + V&V)
A comparison of the summary statistics for each process is revealing: Measure Risk Resources Project Duration Conventional Process 40 799 18 months Quantitative Systems Engineering Process 9 476 12 months % Improvement 78 40 33
Risk is computed as: the square root of the (variance of Overall Project Resource curve divided by the square of the fraction of project remain ing at the point of peak resource deployment). The use of variance, as a conservative measure, parallels the use of the concept used in the Black-Scholes option pricing model in which high option prices reflect the risk associated with their purchase. The mitigation of risk is the biggest factor. And this, together with the 33% reduction in time-to-market makes on overwhelming case for the adoption of the Quantitative Systems Engineering Process.
3. Strategic Engineering Purpose Driven Optimization Strategic Engineering is defined as: The ability to optimize products through structuring families of controller architectures using quantitative processes (typically empirical) and the utilization of innovation to minimize time-to-market, resources required, and project risk.
The optimization of systems at the Virtual Prototype level, requires the formulation of a specific objective function (such as, maximize speed, maximize throughput, minimize power). The scientific method is about rejecting hypotheses using a rational, data driven decision making process. One of the challenges in making decisions in this engineering domain is the complexity of modern super systems and identifying patterns in, and making sense of, the potentially billions of pieces of data collected from hundreds of unique sources of measurement of platform activity and latency, available from the silicon and simulation. On the optimization side, there are many ways to construct objective functions. The classical way is to track event frequencies and/or latencies and to construct ad hoc functions based on functionally related events, such as CPU events, bus and busbridge events, memory events, device events, etc. A more systematic way is to use the multivariate statistics to help formulate dependence relations based on abstract concepts and more concrete factors derived from the interpretation of highly correlated events measured during simulation or silicon activity. The latter approach is beyond the scope of this paper.
4. Measuring Systems The Enabling Metre The fundamental premise of optimization is that there is an ability to accurately measure the system being optimized. In a Virtual System Prototype, the requirement is to be able to probe processor models, bus and bus bridge models, device models, software algorithms and structure. These are the elements that enable decisions to be based on data.
Measurements are typically of functionality (such as cache activity) and time (such as intervals between two significant events). Sequences of measures are also important to provide history-based optimization computations. An example of a history-based optimization function appears below in Equation 1. In an event driven simulation environment, a general form of an optimization function (in this case power) can be expressed as a function whose parameters are functions each characterizing contributions to the objective function by one of the components constituting the system, viz. CPUs, buses, bus bridges, memories and peripheral devices. This equation is based on functions (f x) that take occurrences of event sequences that correspond to the degree of activity in a particular element of a design. The weights (Wx) applied to functions give relative values to each function described for a Virtual Prototype. The weights themselves may be functions.
Equation 1 : f P o w e r = .W P i p e f P i p e + W I n s t r f I n s t r + W C a c h e f C a c h e + W T L B f T L B + W Re gAcc f Re gAcc + W M e m A c c f M e m A c c + W PeriphAcc f PeriphAcc where. f I n s t r = .2 f I n s t r , j m p + 2 f I n s t r , e x c e p t + 0 f I n s t r , c t r l + 1 2 f I n s t r , c o p r o c + 0 f Instr , L d S t + f Instr ,a r i t h + f Instr ,o t h e r and . f Instr , i = . ( i n s t r u c t i o n s o f t y p ei in k c y c l e s)
15
These are the types of functions that drive the quantitative engines of empirical design.
Graph 3B: Power Consumption - Sieve of Eratosthenes on ARM926E Subsystem of Fig.1 VSP
10.00 Ave. Power * 10^7 / # Instructions 9.00 8.00 7.00 6.00 5.00 4.00 0 200 400 600 800 1000 Cache Size (Bytes) CL = 32B, Mem = DDR CL = 32B, Mem = SDR CL = 16B, Mem = DDR CL = 16B, Mem = SDR
essentially achieved full speed. The power graph shows another picture. Uncached power consumed by the VSP was 20% - 35% less than the power consumed by 64 byte caches (variability was due to cache line size, wayness and memory type) and 200% higher than power consumed with 128 byte caches. What we are observing here is the step -function effect on power consumption of installing a cache in a processor. For the sieve program, beyond 512 bytes, the power consumed was stable and about 20% higher than the minimum cache configuration at 128 bytes.
The effect on power consumption of installing a small cache in a processor to achieve a 4-fold increase in performance has a detrimental effect on power consumption due to the infrastructure required to support the cache. The cost of a cache is also high since the infrastructure consumes relatively large silicon real estate. These considerations led to an investigation of alternative memory hierarchies that might achieve a better trade-off between speed, power and cost for a controller running a limited amount of code in an embedded application. We varied the cache_hit/miss power weightings of the processor to mimic the relative power consumed by a dedicated external buffer of 128 bytes (essentially a small, physically addressed, direct-mapped, on-chip cache external to the processor). This architecture is similar to the buffer organization found in processors like the Renesas SH2A [5] a processor popular in automotive control where differences of cents in the price of a controller translate to several million dollars in large manufacturing runs. The results were that we could achieve a further ~40% power saving whilst maintaining near optimum speed. The cost of the chip is close to the non-cache cost. This is not an intuitive result and required empirical investigation and experimentation to determine and optimal outcome.
Headroom MIPS Trans. (2 VPM) 150.00 Count 100.00 (Trans/DSec | MIPS) 50.00 0.00 1024 64 4 Transaction Size (bytes) Headroom MIPS Trans. (1 VPM) Headroom MIPS Trans. (0 VPM) Number of Processors
Trans. (0 VPM) Headroom MIPS Trans. (1 VPM) Headroom MIPS Trans. (2 VPM) Headroom MIPS
Transactions of various sizes (1024, 64 and 4 bytes) being transmitted at a high rate over a complex switch to which are attached two StarCore SC1200 digital signal processors. Initially no processors are activated and each is then successively activated. The results bar chart of Figure 4 is best read as a sequence of 3 pairs (Transaction / Headroom (MIPS) into the slide. As transactions become progressively smaller, there is relatively more work to be performed by the model to transmit and receive
them. The Headroom measure is the amount of available host cycles for further simulation. As more modelled processor are activated and the transaction size is reduced, the available host processing headroom diminishes. If more host CPU cycles were present, perhaps as a dual or quad processing symmetric multiprocessor computing engines , there would provide sufficient cycles to manage high transaction data rates and high levels of target software execution in the simulated multicore networked switch subsystem.
6. Summary
Empirical experimentation is a powerful mechanism with which to refute hypotheses that, when carefully constructed, drive the quantitative engineering process. To engage in this engineering process, prior to the existence of a physical realization, requires the existence and use of a model. If hypothesis building concerns speed, power consumption, reaction time, latency, meeting real-time schedules, etc. the model needs to be timing accurate (processor, buses, bus bridges and devices). If the extensive execution of software is an intrinsic part of the empirical experimentation, then the model needs to have high performance across all components. This paper assumes the existence of pre -silicon, high performance (20100 MIPS), timing accurate virtual system prototypes. The ability to instrument simulatable system models also helps identify compute engines that are optimized for the task of simulating particular types of complex models of systems. An interesting meta-use of the instrumentation of models. Optimizing systems with complex objective functions is not intuitive. Complex tradeoffs between hardware structure and the software and algorithms that are executed on the hardware cannot be done by conjecture or formal analysis alone, the acquisition of data as part of well-formed experiments refuting thoughtfully constructed hypotheses enables decision making driven by results. Optimization comes from considering the whole system - hardware and software together not separately.
7. References
[1]
[2] [3]
[4] [5]
Winters, F.J., Mielenze, C. and Hellestrand, G.R. Design Process Changes Enabling Rapid Development. Proc. Convergence 2004 P-387, Oct 2004, 613624, Society of Automotive Engineers, Warrendale, PA. Hellestrand, G.R. The Engineering of Supersystems. IEEE Computer, 38, 1(Jan 2005), 103-105. Hellestrand, G.R. Systems Architecture: The Empirical Way Abstract Architectures to Optimal Systems. ACM Conf. Proc. EmSoft2005, Sept 2005, Jersey City, NY. Montgomery, D.C. Design and Analysis of Experiments. 5 th Ed. John Wiley & Sons, NY, 2001. Renesas SH-2A, SH2A-FPU Software Manual, Rev 2.00, REJ09B0051 -0200O, 13 Sept. 2004, Renesas Technology, Tokyo, Japan.