RESEARCH PAPERS OF MR NABIL LITAYEM

Transcription

1 RESEARCH PAPERS OF MR NABIL LITAYEM Published during his PHD entitled METHODOLOGICAL CONTRIBUTIONS TO THE DESIGN AND OPTIMIZATION OF EMBEDDED SYSTEMS

2 International Journal 1. N Litayem, M Kolhar, I Mhadhbi, S M. Abd El-atty, S Ben Saoud. Hashing Based Authentication for Ultra-Low Cost Low Power SCADA Application Using MSP430 Microcontroller, IJETAE Volume 3, Issue 12, December N Litayem, B Jaafar, S Ben Saoud. Embedded Microprocessor Performance Evaluation Case Study Of The Leon3 Processor, JESTEC Vol. 7, No. 5 (2012) N Litayem, AB Achballah, SB Saoud. Building XenoBuntu Linux Distribution for Teaching and Prototyping Real-Time Operating Systems, IJACSA Vol. 2, No.2, February N Litayem, S Ben Saoud. Impact of the Linux Real-time Enhancements on the System Performances for Multi-core Intel Architectures, IJCA 2011, Number 2 - Article N Litayem,S Ben Saoud Rapid Hardware-In-the-Loop Implementation for FPGA Based Embedded Controller for More Reliable Electrical Traction Systems, JCSCS 2010 Volume 3, Issue 2. International conference 1. I Mhadhbi, N Litayem, S Ben Othmen, and SBen Saoud. "DSC Performance Evaluation and Exploration, Case of TMS320F335."International Conference on Control, Engineering & Information Technology (CEIT 13) Proceedings Engineering & Technology - Vol.1, N Litayem, M Ghrissi, AK Ben Salem, S Ben Saoud. Designing and building embedded environment for robotic control application, IECON 2009 Porto, Portugal, November 3-5, AK Ben Salem, S Ben Othman, S Ben Saoud, N Litayem. Servo Drive System Based on Programmable SoC Architecture, IEEE IECON 2009 Porto, Portugal, November 3-5, N Litayem, S Ben Saoud. Impact of real-time enhancements on the computation performances for multi-core intel architectures, JTEA 2010, Hammamet, Tunisia, April N Litayem, M Ghrissi, S Ben Saoud. Etude de l Influence de la Communication Processeur Mémoire sur la Performance d un SoC, à base de circuit FPGA XILINX, JTEA 08, Hammamet, Tunisia, April M Ghrissi, S Ben Saoud, N Litayem. Correction des erreurs systématiques de l odomètre et suivi de trajectoires sur un robot mobile industriel type tricycle, JTEA 08, Hammamet, Tunisia, April N Litayem, M Ghrissi, S Ben Saoud. Etude Comparative des moyens de communications inter processeurs dans les architectures MPSoC, GEI 08, Sousse, Tunisia, March N Litayem, M Ghrissi, S Ben Saoud. Embedded Microprocessor Systems Hardware Performance Evaluation and Benchmarking, ICESCA 08, Gammarth, Tunisia, May 2008.

3 International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , ISO 9001:2008 Certified Journal, Volume 3, Issue 12, December 2013) Hashing Based Authentication for Ultra-Low Cost Low Power SCADA Application Using MSP430 Microcontroller Nabil Litayem 1, Manjur Kolhar 2, Imene Mhadhbi 3, Saied M. Abd El-atty 4, Slim Ben Saoud 5 1,2,4 Computer Science and Information, Salman Bin Abdulaziz University, Wadi College of Arts and Science, Kingdom of Saudi Arabia 3,5 LSA Laboratory, INSAT-EPT, University of Carthage, TUNISIA Abstract Nowadays SCADA (Supervisory Control and Data Acquisition) systems became widely used technology. This fact is directly related to the ubiquity of smart systems using a wide range of technologies in control and supervision applications. MCU technology are a very promising technology in this field especially, with the emergence of safety critical MCU, cost/power reduction and wireless connectivity. Due to the ubiquitous use of such technologies, security considerations must be considered. In this paper, we propose a RTU (Remote Terminal Unit) authentication solution that is based on a lightweight hashing algorithm. Proposed solution is suitable for SCADA systems using MSP430 ultra low power low cost MCU. This work is seen as a proof of concept of using such technology with freely available tools to add reliable authentication functionality of our previously designed SCADA systems. Keywords SCADA, MSP430, RTU, Security, hash-based authentication. I. INTRODUCTION Earlier SCADA system were based on an event-driven operating system and basic serial communications. This kind of solution does not have any security threats because complete physical isolation SCADA devices from any external intrusion. Thanks to Moor Law, SCADA Supervisory Control and Data Acquisition [1] applications become cost effective and ubiquitous. Such solution are based on standard hardware, open source software and open protocols. SCADA applications are nowadays used in power distribution monitoring, nuclear simulators, military data acquisition, health care applications and many thousands of various applications [2], [3] furthermore, they are considered as a part of Internet of Things ecosystem. The ubiquity requires many careful security considerations to ensure confidentiality, integrity and availability of such systems. Any compromise in SCADA system security can have serious consequences [4]. During this last decade, many research works have studied the security of such system and proposed innovative solutions, [5], [6], and [7]. In this study, we introduce an authentication solution using a hashing algorithm for MSP430 microcontroller for SCADA RTU. The proposed solution has the authentication system or algorithm using various profiles of Quark [8] hashing algorithm, which are chosen after qualitative and quantitative surveys that are presented in this paper. The remainder of this work is organized as follows: Section 2, gives a presentation of MSP430 development platform, followed by a survey about SCADA applications and their availability solutions and applications. In section 4, we present the choice and execution of the hashing algorithm. Finally, section 5 concludes this contribution. II. HARDWARE PLATFORM A. Introduction to MSP430 Microcontroller Known for its low power consumption, MSP430 from Texas Instruments is a family of 16-bit microcontrollers commonly used in wireless sensors/actuator network and metering applications [8]. The utilization of these MCU becomes too broad due to the introduction of new innovative features,- apart from low cost and low power. The main features of MSP430 microcontroller are summarized in Table

4 International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , ISO 9001:2008 Certified Journal, Volume 3, Issue 12, December 2013) Feature Instruction sets Registers Memory Addressing modes Peripherals Frequency Electric Power On-Chip Memory TABLE 1. MAIN CHARACTERISTICS OF THE MSP430 MCU Description 27 RISC instructions 12 general purpose registers 16 Bit Word or Bytes Addressing Register direct, register indexed, register indirect and register indirect USART, SPI, I²C, 10/12/14/16-bit ADCs, internal oscillator, timer, PWM, watch dog, brownout reset circuitry, comparators, on-chip op-amps, 12-bit DAC, LCD driver, hardware multiplier, USB, and DMA 1Mhz- 25Mhz <1µA in IDLE mode 256KB Flash, 16 KB RAM B. MSP430 family and development tools Texas instrument has a wide range of MSP430 flavours designed for diverse applications, such as smart metering, wireless communication, motor control, personal health care, etc. For each applications of MSP430 flavour Texas Instrument has a development or evaluation board. The most successful development boards are MSP- EXP430F5529, ez430 Chronos [9] and MSP430 Launchpad [10]. MSP430 has the advantage of complete software ecosystem ranging from powerful development environment such as IAR, Code Composer Studio and Energia to very appropriate software stack such as SimpliciTI [11] or Capacitive Touch sense library. In the other hand, TI MCU solutions are also very cost effective and scalable. The wide variety of available TI MCU offers the possibility to easily change from one TI MCU to another. C. Launchpad Board Since 2010, Texas Instrument has expanded MSP430 portfolio by introducing MSP430 Value Line shown in Figure 1. This new low cost family starting at 0.25$, is essentially intended to replace the old 8-bits MCU. Fig 1. Functional Block Diagram, MSP430G2x53 Fig 2. Launchpad Board To promote this new family, TI has introduced the MSP- EXP430G2 LaunchPad showed in Figure 2. This evaluation board is a low cost very valuable evaluation platform with a price of 4$30. Launchpad can be used to develop applications for the overall Value Line MSP430 microcontrollers. 501

5 International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , ISO 9001:2008 Certified Journal, Volume 3, Issue 12, December 2013) III. SCADA SYSTEMS A. Introduction SCADA Systems Nowadays, SCADA systems (Supervisory Control and Data Acquisition) are almost available in a wide range of electronic devices and applications such as steel making, electric grid, healthcare devices and chemistry. On the other hand, SCADA has become vital to drive critical experiments such as nuclear fusion. SCADA is not specific to a precise technology, but a type of application. Any application that gets data about a system in order to control that system is a SCADA application. Furthermore, SCADA are computer-based systems that introduce various advanced and innovative supervisory functionalities. This leads to automate traditional complex industrial processes where human control is impractical. Critical infrastructures and industries are nowadays requiring excessive use of this kind of technologies. Client Client Dedicated Server Data Server Network Network Controller Controller Controller Fig 3. SCADA System Architecture [12] As shown in Figure 3, typical SCADA application control systems, collect field and sensor data, processes and displays the collected data, and send commands to the controlled systems. In industrial control system, geographic location is the main classification criteria between SCADA and DCS (Distributed Control Systems), since DCSs are used within a single processing or generating plant or over a small geographic area [12] and SCADA systems are used for large geographically dispersed distribution operations. If we consider nuclear power plant, DCS can be used in power production and SCADA in power distribution. Nowadays, with the emergence of Smart Grid and Internet of Things Concepts, SCADA systems more considered. Our work is based on SCADA systems, but it may be extended to DCS. 502 B. SCADA system Architecture A SCADA system has three main basic components [13]: Remote Terminal Unit (RTU) is an intelligent part connected to the controlled process. RTU is responsible for reading inputs, make a smart decision, provide outputs signal, take new orders and provide real time feedback to the HMI. Human Machine Interface (HMI) is the interface between user and SCADA system. HMI must provide intelligible data about the physical controlled process. Communication infrastructure is used to connect various components to the SCADA system. Communication infrastructure is responsible to handle various communication protocols and provide some bridging capabilities between RTU network and corporate network. C. Security of SCADA systems Earlier SCADA systems were not designed for public access considerations. The only possible security threat was the physical destruction. Due to the interconnection of modern SCADA systems to public networks, several security considerations must be seriously considered. Many researchers proved the existences of the threats to the SCADA system by simulations and real systems [14]. Because of the security flaws present in the SCADA system many academicians and various organizations are putting efforts to make SCADA safe from the threats [15].Sandia National Laboratories (USA), National Infrastructure Security Co-ordination Centre (UK) and British Columbia Institute of Technology (Canada) are the most influential organizations working in this field. Attack Denial of service Unauthorized changes Wrong information sends Control system software modification Interference TABLE 2 COMMON ATTACKS FOR SCADA SYSTEMS Impact on the system Delaying or blocking the stream of data through control networks Modification of programs instructions in RTUs at remote sites, resulting in damage to equipment, precipitate closure of processes, or even disabling control equipment Used to control system operators to disguise unauthorized changes or to initiate inappropriate actions by system operators Producing unpredictable results Interfere with the operation of safety systems.

6 International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , ISO 9001:2008 Certified Journal, Volume 3, Issue 12, December 2013) D. Studied SCADA RTU Our designed system has the ability to supervise and control various greenhouse signals such as temperature, humidity and pressure. This system can also be remotely controlled to initiate or receive critical alarms. Our studied SCADA RTU is a PID thermal process controller with supervision capability using USCI interface. This intelligent part of this RTU is based on MSP430G2553 MCU. Fig 6. Designed SCADA RTU The system illustrated by Figure 6 is designed to emulate greenhouse temperature regulation with some local supervision features such as LCD and LED interface. In the other hand the supervision interface is designed using Visual Basic language. Through this interface we can tune the PID regulator, fix the temperature consign and supervise the evolution of the temperature. Figure 4, and Figure 5 show some views of this interface. Fig 4. Supervision interface Fig 5. Configuration Interface IV. CHOICE AND ADAPTATION OF LIGHTWEIGHT HASHING ALGORITHM The goal of this section is to review the available hashing algorithm in order to adopt an appropriate one as an authentication solution for our SCADA system. The choice of the hashing algorithm will be made according to security level, computational complexity and memory footprint. The two last criteria are primordial since our hardware MSP430G2553 microcontroller uses16kbyte of Flash memory, 512 Byte of RAM memory, and can t go over 16MHz in frequency. A. Introduction hashing Algorithm Hashing algorithms [16] are commonly used in computing, their main purpose is to map a variable message length to a fixed length message. It consists of applying H (hashing function) to x (message) to produce H (x) called the message hash. On the other hand, finding y as H (y) =H (x) must be computationally infeasible. This behavior can be used in the following fields: 503

7 International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , ISO 9001:2008 Certified Journal, Volume 3, Issue 12, December 2013) Authentication algorithms Password storage mechanisms Digital Signature Standard (DSS) Transport Layer Security (TLS) Internet Protocol Security (IPSec) Random number generation algorithms In our application, hashing algorithm will be used to protect the authenticity of transmitting information and to offer reliable authentication mechanism. B. Review of lightweight hashing algorithm Hashing algorithms are widely used for a broad type of applications. Nowadays, many hashing algorithms are available. Each algorithm can be more adapted for specific fields such as a powerful video platform, FPGA platforms, low performance platforms etc. Table 3, review these algorithms to deliver a big view about these algorithms. Based on this review, we will take the right algorithm to be suitable for our application. Algorithm SHA family MD4 MD5 MD6 Quark CubeHash Grøstl Lane Shabal Spectral Hash Keccak-f TABLE 3 CANDIDATE HASHING FUNCTIONS Presentation Secure Hash Algorithms are a family of Hash Algorithms published by NIST since SHA has many derivative standards such as SHA-0, SHA-1, SHA-3 Message-Digest Algorithm is a family of broadly used cryptographic hash function developed by Ronald Rivest that produces a 128-bit for MD4 and MD5, 256-bit for MD6 Family of cryptographic functions designed for resource-constrained hardware environments. A very simple cryptographic hash function designed in University of Illinois at Chicago, Department of Computer Science Hashing algorithm designed by a team of cryptographers from Technical University of Denmark (DTU) and TU Graz Cryptographic hash function suggested in the NIST SHA-3 competition by the COSIC research group Cryptographic hash function submitted by the France funded research project Saphir to NIST s Cryptographic hash function family based on the discrete Fourier submitted to the NIST hash function competition Cryptographic hash function submitted to the NIST SHA-3 hash function competition 504 Whirlpool UHASH SPONGENT Photon dm-present SQUASH C. Quark Hashing Algorithm Whirlpool is a cryptographic hash function recommended by the NESSIE project, adopted by the ISO and IEC as part of the ISO/IEC standard. UHASH is a keyed hash function, specified in RFC4418. The primary application of this algorithm is in UMAC message authentication code. Lightweight hash-function family, known for their small footprint for hardware implementation A lightweight hash - function designed for very constrained devices Ultra-lightweight block cipher designed for RFID applications Not collision resistant, suitable for RFID applications As noted in [21], designers of lightweight cryptographic algorithms or protocols have trade-off between two opposite design philosophies. The first one consists in creating new schemes from scratch, whereas the second consists in reusing available schemes and adapting them to system constraints. The main features of Quark are separating digest length, security level and working with shift registers. D. Adoption of Quark hashing algorithm to the MSP430G2553 microcontroller In our SCADA system the execution of the hashing algorithm is just used during new supervision node connection, then this algorithm can have a middle complexity level since during the authentication the system does not have any notable load. V. PERFORMANCE EVALUATION OF VARIOUS QUARK PROFILES RUNNING UNDER MSP430G2553 E. Obtained results After adapting the various Quark algorithm profiles for MSP430G2553, we did some performance evaluation according execution time detailed in Table 4 and algorithm footprint detailed in Table 5.

8 International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , ISO 9001:2008 Certified Journal, Volume 3, Issue 12, December 2013) TABLE 4. EXECUTION TIME OF VARIOUS PROFILES OF UQUARK ALGORITHM Alghorithm Execution time (ms) UQUARK DQUARK SQUARK CQUARK TABLE 5. FOOTPRINT OF VARIOUS PROFILES OF UQUARK ALGORITHM Alghorithm Footprint (Byte) UQUARK 4057 DQUARK 4188 SQUARK 4230 CQUARK 4376 F. Results analysis The obtained result reflects the good performance of Quark algorithm, the lighter version can be very appropriate for wireless sensor network applications. On the other hand, the overhead of complete profile is acceptable and we think that this interpretation can be more adapted for modern SCADA applications. We would like to emphasize that this outcome is obtained with 1 MHz MCU frequency, which can be easily improved by increasing the frequency of the MCU since the adopted MCU can run up to 16 MHz or by switching to higher MCU family. VI. CONCLUSION AND PERSPECTIVES This study focused on SCADA system security but since the boundaries between SCADA systems, DCS, WSN, WSAN and IoT become bluer, this work can be extended to encompass such areas. In the other hand, Texas Instruments introduced a large brand of Launchpad board for various MCU platforms. These platforms offer various types of features. Investigating these platforms can be an interesting extension to this work. Actually, we plan to expand this work by adding an appropriate lightweight cryptography algorithm [22] to our platform. Such solution can be very interesting in distributed SCADA applications. Acknowledgment I would like to thank the Salman Bin Abdulaziz University for the continued and positive support for scientific researches, and thank the anonymous reviewers for their valuable comments. 505 REFERENCES [1] WHAT IS SCADA?, A. Daneels, CERN, Geneva, Switzerland W.Salter, CERN, Geneva, Switzerland, International Conference on Accelerator and Large Experimental Physics Control Systems, 1999, Trieste, Italy [2] Efficient SCADA Module for Improving Medical Information Monitoring and Reliable Medical Service in Hospital Networks Randy S. Tolentino1), Sungwon Park2), Journal of Security Engineering 2010 [3] Study of Wireless Sensor Network in SCADA System for Power Plant, U. S. Patil, International Journal of Smart Sensors and Ad Hoc Networks (IJSSAN) ISSN No (Print) Volume-1, Issue-2, 2011 [4] Vulnerability Assessment of Cybersecurity for SCADA Systems, Chee-Wooi Ten, Student Member, IEEE, Chen-Ching Liu, Fellow, IEEE, and Govindarasu Manimaran, Member, IEEE, IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 23, NO. 4, NOVEMBER 2008 [5] Wang, Yongge. "sscada: Securing SCADA infrastructure.( 2012 ) arxiv: communications."arxiv preprint [6] A Testbed for Secure and Robust SCADA Systems, Annarita Giani, Gabor Karsai, Tanya Roosta, Aakash Shah, Bruno Sinopoli, Jon Wiley [7] Secure SCADA framework for the protection of energy control systems, Cristina Alcaraz1, Javier Lopez1, Jianying Zhou2 and Rodrigo Roman1, Concurrency Computat.: Pract. Exper. 2011; 23: [8] Aumasson, Jean-Philippe, et al. "Quark: A lightweight hash." Cryptographic Hardware and Embedded Systems, CHES Springer Berlin Heidelberg, [9] Yoo, Seong-eun. "A Wireless Sensor Network-Based Portable Vehicle Detector Evaluation System." Sensors 13, no. 1 (2013): [10] Chernbumroong, Saisakul, Anthony S. Atkins, and Hongnian Yu. "Activity classification using a single wrist-worn accelerometer." Software, Knowledge Information, Industrial Management and Applications (SKIMA), th International Conference on. IEEE, [11] Nikitin, Pavel V., Shashi Ramamurthy, and Rene Martinez. "Simple Low Cost UHF RFID Reader." [12] Friedman, Larry. "SimpliciTI: simple modular RF network specification." Update (2007): [13] Daneels, Axel, and Wayne Salter. "What is SCADA." International Conference on Accelerator and Large Experimental Physics Control Systems [14] Daneels, A., & Salter, W. (1999, October). What is SCADA. In International Conference on Accelerator and Large Experimental Physics Control Systems (pp ). [15] Davis, C. M., Tate, J. E., Okhravi, H., Grier, C., Overbye, T. J., & Nicol, D. (2006, September). SCADA cyber security testbed development. In Power Symposium, NAPS th North American (pp ). IEEE. [16] Igure, V. M., Laughter, S. A., & Williams, R. D. (2006). Security issues in SCADA networks. Computers & Security, 25(7),

9 International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , ISO 9001:2008 Certified Journal, Volume 3, Issue 12, December 2013) [17] Jianwei, L., & Huijie, C. (2013). A Dynamic Hashing Algorithm Suitable for Embedded System. TELKOMNIKA Indonesian Journal of Electrical Engineering, 11(6). [18] Balasch, J., Ege, B., Eisenbarth, T., Gérard, B., Gong, Z., Güneysu, T& Von Maurich, I. (2013). Compact implementation and performance evaluation of hash functions in attiny devices. In Smart Card Research and Advanced Applications (pp ). Springer Berlin Heidelberg. [19] Aumasson, J. P., Henzen, L., Meier, W., & Naya-Plasencia, M. (2013). Quark: A lightweight hash. Journal of cryptology, 26(2), [20] Guo, J., Peyrin, T., & Poschmann, A. (2011). The PHOTON family of lightweight hash functions. In Advances in Cryptology CRYPTO 2011 (pp ). Springer Berlin Heidelberg. [21] Bogdanov, A., Knežević, M., Leander, G., Toz, D., Varıcı, K., & Verbauwhede, I. (2011). SPONGENT: A lightweight hash function. In Cryptographic Hardware and Embedded Systems CHES 2011 (pp ). Springer Berlin Heidelberg. [22] Eisenbarth, T., & Kumar, S. (2007). A survey of lightweightcryptography implementations. Design & Test of Computers, IEEE, 24(6),

10 Journal of Engineering Science and Technology Vol. 7, No. 5 (2012) School of Engineering, Taylor s University EMBEDDED MICROPROCESSOR PERFORMANCE EVALUATION CASE STUDY OF THE LEON3 PROCESSOR NABIL LITAYEM 1,2, *, BOCHRA JAAFAR 2, SLIM BEN SAOUD 1 1 LECAP, EPT-INSAT, Centre urbain nord, BP 676 Tunis cedex, Tunisia 2 Computer Science and Information Department, King Salman Bin Abdulaziz University, Wadi Addwaser, Saudia Arabia *Corresponding Author: [email protected] Abstract In this paper we propose a performance evaluation methodology based on three complementary benchmarks. This work is motivated by the fact that embedded systems are based on very specific hardware platforms. Measuring the performance of such systems becomes a very important task for any embedded system design process. In a classic case hardware performance is a basic result reported by the hardware manufacturer. The personalization of hardware configuration is one of the fundamental task of FPGA based embedded systems designer. They must measure the hardware performance himself. This paper will focus on hardware performance analysis of FPGA based embedded system using freely available benchmarks to reflect various performance aspects. We used in our study two embedded systems (Mono-processor and Bi-Processor) based on LEON3MMU processor and ecos RTOS. Keywords: Embedded Systems, Performance, Benchmark. 1. Introduction The human activity becomes more and more reliant to embedded systems that are actually present in many products like PDA, camera, telephones etc. Designing this kind of systems can take many approaches depending on the used platform. In a classic approach, General Purpose processor (GPP), Application Specific Processor (ASIP) or Application Specific Integrated Circuit (ASIC) can be used as a heart of the embedded system. Each one of precedent solutions has its advantages and weaknesses. Actually we assist to the Field Programmable Gate Array (FPGA) based embedded systems emergence. This kind of solution can allow rapid embedded 574

11 Embedded Microprocessor Performance Evaluation Case Study Nomenclatures CPI MFLOPS MIPS MOPS Cycles per instruction Million Floating Point Operations Per Second Million Instructions Per Second Millions of Operations Per Second Abbreviations AMBA ASIC ASIC ASIP EEMBC FPGA GNU GPL GPP HAL HDL MMU PDA POSIX RISC RTEMS RTOS SPEC VAX Advanced Microcontroller Bus Architecture Application-Specific Integrated Circuit Application-Specific Integrated Circuit Application-Specific Instruction-set Processor EDN Embedded Microprocessor Benchmark Consortium Field-Programmable Gate Array Gnu's Not Unix General Public License General Purpose Processor Hardware Abstraction Layer Hardware Description Language Memory Management Unit Personal Digital Assistant Portable Operating System Interface for Unix Reduced Instruction Set Computer Real-Time Executive for Multiprocessor Systems Real-Time Operating System Standard Performance Evaluation Corporation Virtual Address extension systems generation [1], easy personalization of hardware configuration using pre-designed Hardware IPs [2], future evolution of embedded system and cost reduction. To design an FPGA based embedded system we have to choose an embedded processor and embedded operating system. Embedded processor can be Hard-Core (built in silicon level) or Soft-Core (netlist or as HDL source). Hardcore embedded processor has the advantage of computing performance but limit the system in terms of portability and scalability. Soft-Core embedded processor offer less computing possibility if the final platform is an FPGA, but is greatly enhanced in term of configurability, portability, customization and scalability [3, 4]. Embedded operating system can be shared time or real-time, proprietary or open source. The adoption of an embedded operating system depends on the available memory, developers' strategy, real-time requirements, operation fields certification etc. On the other hand, the hardware resources limitation for embedded software requires that the developer must have a clear idea about the hardware computing performance. Actually, many studies focus on hardware performance evaluation or estimation of customized architecture [5]. In this paper we present an approach to measure hardware performance of FPGA based embedded system using freely available benchmark solutions. This work is divided into six parts. The first part is a survey about the performance evaluation. The Journal of Engineering Science and Technology October 2012, Vol. 7(5)

12 576 N. Litayem et al. second part presents an overview about the used platform. In the third part we will present the adopted benchmarks. This is followed by the experimental condition and environment preparation presentation. In the fifth part experimental results are presented. The last section summarizes our position and future works. 2. Overview of Performance Evaluation Tools And Techniques Evaluating performance in computer system [6] will always be a true challenge for designer of this kind of systems due to the constant evolution of such systems, especially for embedded system field where the architecture tend to be more and more complex [7]. This kind of activity is actually related to the performance engineering field which tries to propose tools and methods to quantify nonfunctional requirements of computer systems. The performance analysis of an embedded system may have various aspects depending on the application for which the embedded system is designed. Several design decisions are a direct result of performance evaluation. The first evaluation method was based on the number of operations per second. This approach quickly became deprecated and traditional benchmark was developed and adopted to measure special performance aspects. This aspect can be one of the various computer system criteria. Actually we have too many solutions to measure hardware performance. The most part of solutions are based on standard algorithms that are executed and used to report a number which reflects the performance of the hardware speed in a special field. In generally the MIPS unit in its literal meaning millions of instructions per second is used to measure the processor hardware performance. This unit became insignificant when RISC computer architectures appeared since the performed instruction by one CISC computer cycle requires several RISC instructions. Which lead to the MIPS redefinition as VAX MIPS which is the factor for a given machine relative to the performance of a VAX 11/780. Other redefinitions are later proposed such as Peak MIPS, and EDN MIPS. Due to the fuzziness of performance unit definition, several benchmarks are proposed that can be executed under various architectures to report their execution speed. The most recognized solutions are Dhrystone which report the performance of the architecture in Dhrystone MIPS, Stanford which computes different algorithms and report the performance of the architecture in every computation field covered by the benchmark, and Paranoia who is able to report the characteristics of the floating point unit. Nowadays, there are several benchmarks that include the previously presented ones combined with other widely known benchmarks such as Whestone and Linpac. Each one of these benchmarks tries to reflect the most of the hardware performance aspects. Mibench [8] is the most popular implementation of this combination since it can measure the performance of a studied architecture in several application fields. We can also find other commercial benchmarking solutions more efficient and more specialized like SPEC (Standard Performance Evaluation Corporation) which cover different computing field or EEMBC (Embedded Microprocessor Benchmark Consortium) designed especially for embedded systems. Journal of Engineering Science and Technology October 2012, Vol. 7(5)

13 Embedded Microprocessor Performance Evaluation Case Study In this paper we will present a hardware performance analysis of monoprocessor and bi-processor embedded systems using three benchmarks, each one cover one side of computing system. Our platform is based on two open source components (LEON3 Processor and ecos RTOS) allowing us to be independent to any FPGA or RTOS vendor. 3. Presentation of the Used Platform 3.1. Overview of LEON3 microprocessor LEON3 [9] presented in Fig. 1 is a synthesizable VHDL model of a 32-bit processor compliant with the SPARC V8 [10] architecture. The model is highly configurable, and particularly suitable for system-on-a-chip (SOC) designs. The full source code is available under the GNU GPL (General Public License), allowing free and unlimited use for research and education. LEON3 is also available under a low-cost commercial license, allowing it to be used in any commercial application for a fraction of the cost of comparable IP cores. On the other hand Gaisler research offers a fault tolerant version of the LEON3 for a very competitive cost. The LEON3 processor is distributed as a part of the GRLIB IP library, allowing simple integration into complex SoC designs. GRLIB also includes a configurable LEON multi-processor design, with up to 16 CPU's attached to AHB bus, and a large range of on-chip peripheral blocks. The GRLIB library contains template designs and bitfiles for various FPGA boards from Actel, Altera, Latice and Xilinx. IRQ 15 Interrupt Control 3-Port Regfile MUL32 MAC 16 DIV 32 7-Stage Integer Pipeline Instruction Cache Data Cache IEEE 754 Floating- Point Unit Co-Processor Debug Interface Debug I/F MMU Trace Buffer AMBA AHB Interface 32 Minimum Configuration Optional Blocks Co-Processors Fig. 1. Overview of LEON3 Architecture [9] Overview of ecos RTOS ecos is an abbreviation of Embedded Configurable Operating System [11]. It is an open source, royalty-free, real-time operating system intended for deeply embedded applications. The highly configurable nature of ecos allows the operating system to be customized to precise application requirements, delivering the best possible run-time performance and an optimized hardware resource footprint. A thriving net community has grown up around the operating system Journal of Engineering Science and Technology October 2012, Vol. 7(5)

14 578 N. Litayem et al. ensuring on-going technical innovation and wide platform support. Figure 2 shows the layered architecture of ecos. Application Libraries Compatibility RedBootR OM Monitor Math C POSIX µitron Web Kernel Hardware Abstraction Layer Interrupts Virtual Vectors Exceptions Ethernet Serial Flash Target Hardware Networking Stack Server File System Device Driver Fig. 2. Overview of ecos Architecture. The main components in ecos architecture are the HAL (Hardware Abstraction Layer) and ecos Kernel. The purpose of ecos HAL is to allow the application to be independent of hardware target. It can manipulate the hardware layer using the HAL API. This HAL is also used by others upper OS layer which make porting ecos to a new hardware target a simple task consisting of developing the HAL of the new target. ecos kernel is the core of ecos system, it includes the most part of modern operating system components: scheduling, synchronization, interrupt, exception handling, counters, clocks, alarms, timers, etc. It is written in C++ language allowing application written in this language to interface directly to the kernel resources. The ecos kernel also supports interfacing to standard µitron and POSIX compatibility layers Combination ecos LEON3 The choice of these components allows us to be independent from any FPGA constructor or RTOS vendor since ecos is available for a wide range of embedded processors and LEON is ported for XILINX, ALTERA, ACTEL and LATICE FPGA. In the other hand ecos allows a smooth migration to embedded Linux. We can also use OpenRISC [12] as processor. RTEMS [13] or embedded Linux [14] can be adopted as OS. The performance measure will be presented using the three benchmarks that we will present, and the hardware platform will be simulated using tsim-leon3 for a mono-processor architecture and grsim-leon3 for the bi-processor platform in SMP (synchronous multi-processing) configuration. These two simulation tools are able to represent very closely the LEON3 architecture with many other very important enhanced features for system prototyping. Journal of Engineering Science and Technology October 2012, Vol. 7(5)

15 Embedded Microprocessor Performance Evaluation Case Study Used Benchmarks In our work we have chosen to adopt three complementary benchmarks that can mirror a complementary performance side of the studied embedded system. Each one of the obtained results of these benchmarks can be used to evaluate a special performance aspect of the hardware platform Dhrystone Dhrystone [15] is a synthetic benchmark developed in 1984 by Reinhold P. Weicker in ADA, Pascal and C languages. It is intended to be representative of integer system performances. Dhrystone is constituted from 12 procedures included in one measuring loop with 94 statements. The Dhrystone grew to become representative of general processor (CPU) performance until it was outdated by the CPU89 benchmark suite from the Standard Performance Evaluation Corporation [16], today known as the "SPECint" suite Stanford The Stanford Benchmark Suite [17] is a small benchmark suite that was assembled by John Hennessy and Peter Nye around the same time period of the MIPS R3000 processors. The benchmark suite contains ten applications, eight integer benchmarks and two floating-point benchmarks. The original suite measured the execution time in milliseconds for each benchmark in the suite. The Stanford Small Benchmark Suite includes the following programs: Perm: A tightly recursive permutation program. Towers: the canonical Towers of Hanoi problem. Queens: The eight Queens Chess problem solved 50 times. Integer MM: Two 2-D integer matrices multiplied together. FP MM: Two 2-D floating-point matrices multiplied together. Puzzle: a compute bound program. Quicksort: An array sorted using the quicksort algorithm. Bubblesort: An array sorted using the bubblesort algorithm. Treesort: An array sorted using the Treesort algorithm. FFT: A floating-point Fast Fourier Transform program. This kind of benchmark is very interesting in terms of exploration of various architecture behaviours Paranoia The Paranoia benchmark [18] is designed by William Kahan as a C programable to characterize floating-point behaviour of computer system. Paranoia does the following test: Small integer operations. Search for radix and precision. Check if rounding is done correctly. Check for sticky bit. Journal of Engineering Science and Technology October 2012, Vol. 7(5)

16 580 N. Litayem et al. Test if 2 X = X for a number of integers. If it will pass monotonicity. If it is correctly rounded or chopped. i Testing power Z, for small Integers Z and i. Searching for underflow threshold and smallest positive number. Q Testing power Z at four nearly extreme values. Searching for overflow threshold and saturation. Tries to compute 1/0 and Experimental Set-Up Gaisler research offer all the required tools to begin developing application using Leon processor combined with a wide range of supported RTOS such ecos. In this section, we present the ecos configuration phases, the benchmarks compilation and their respective execution environment ecos configuration The ecos RTOS installation and configuration can be divided into four steps. In our case we have chosen to use Windows host but similar approach can be followed with Linux Environment installation Since ecos and their associated tools require Linux-like environment, we must begin by installing Linux emulation environment. For this purpose there are two candidates which are Cygwin and MingW. We chose to use MingW due to its lightweight. It must be noted that make, mpfr, sharutils, tcltk, wget, automake and gcc packages must be installed with the emulation environment Cross compiler installation In order to produce Leon3 executable we must have a Leon cross compiler installed in our host. The recommended cross-compiler for ecos is sparc-elf-gcc. This one is available in the Gaisler web site [19] and must be downloaded, decompressed in the /opt directory and installed using export PATH=/opt/sparc-elf mingw/bin:$PATH command Source code and configuration tool In this stage we must install the ecos source code and their configuration tool. These resources are also available in the Gaisler web site [19]. Source code must be downloaded and decompressed in a chosen directory that can be used in the configuration phase. The configuration tool is available in different versions that differ according to their software dependencies. For our case we used the native Windows version that does not need additional library. Journal of Engineering Science and Technology October 2012, Vol. 7(5)

17 Embedded Microprocessor Performance Evaluation Case Study ecos configuration ecos is one of the most architecture free RTOS. The choice of a specific target is done using the configuration tool GUI that offer a wide range of hardware platform and software configurations. In this step we must run the ecos configuration tool and select the target platform LEON3 processor as shown in Fig. 3. In the same configuration window we can chose a predefined setting that can be customised later by selecting specific networking stack, debugging interface or any other specific software component. In our case default packages can be enough. Fig. 3. Hardware Platform and Used Package. The build tool and Cygwin binary tools directories must be selected in the configuration tool as shown in Figs. 4 and 5. In the case of using MingW, these ones are located at C:\MinGW\msys\1.0\bin. Fig. 4. ecos Repository. Fig. 5. ecos Build Tools. Journal of Engineering Science and Technology October 2012, Vol. 7(5)

18 582 N. Litayem et al. Finally we must save the configuration file and starting the building process using build item in the configuration tool GUI Benchmarks compilation One of the most valuable tools offered by Gaisler research is the eclipse based IDE. This one offers various wizards for project creation that generates project template for ecos shared library, static library or executable. We used this wizard as showed in Figs. 6 and 7 to generate ecos project template. Fig. 6. ecos Project Generation. Fig. 7. ecos Example Code Generation. The ecos ECOS_INSTALL_DIR environment variable must then be configured in the project setting according the previously chosen ecos install directory. To build adequately the three previously presented benchmarks, we created three projects in which we substitute the generated code with the benchmark source code. In the build configuration panel math library and ecos must be selected to be used during the link process Benchmarks execution Gaisler research offers two simulation platforms. The first one is Tsim-leon3 designed to simulate mono-processor Leon3 based architecture, the second one is grsim-leon3 who is designed to simulate bi-processor Leon3 based architecture. These two simulators have nearly the same usage syntax. We can then load and execute the appropriate benchmark in their respective architecture. 6. Performance Measurement and Analysis In the following section we present the performance evaluation and their interpretation according the flowchart showed in Fig. 8. Journal of Engineering Science and Technology October 2012, Vol. 7(5)

19 Embedded Microprocessor Performance Evaluation Case Study After preparing the environment in terms of configuring, building ecos for LEON3MMU architecture and testing some applications examples running under ecos and executing their benchmarks from SDRAM. We build the three benchmarks for our two hardware configurations described in Table 1. Table 1. Hardware Configuration. Mono-processor Configuration Bi-processor Configuration SDRAM 16 Mbyte in 1 bank 16 Mbyte 1 bank ROM 2048 Kbyte 2048 Kbyte Instruction Cache 1*4 Kbytes, 16 bytes/line 1*4 Kbytes, 16 bytes/line Data cache 1*4 Kbytes, 16 bytes/line 1*4 Kbytes, 16 bytes/line Start Setup evaluation environment Monoprocessor integer performance evaluation Bioprocessor integer performance evaluation Monoprocessor domain specific performance evaluation Monoprocessor floating point characterisation Bioprocessor domain specific performance evaluation Bioprocessor floating point characterisation Monoprocessor performance results Bioprocessor performance results Choose appropriate configuration Stop Fig. 8. Performance Evaluation Flowchart Obtained results using Dhrystone benchmark After executing Dhrystone benchmark under our platforms simulators we have the reported values in Fig. 9 and summarized in Table 2. These results show the Journal of Engineering Science and Technology October 2012, Vol. 7(5)

20 584 N. Litayem et al. performance of studied architecture in term of Dhrystone MIPS. The gain in performance is about 33% with the bi-processor configuration. Fig. 9. Performance in Dhrystones MIPS of the Two Architectures. Table 2. Results Obtained with the Two Platform Simulators for Dhrystone Benchmark. Mono-processor Architecture Bi-processor Architecture Cycles Instructions Overall CPI CPU performance (50.0 MHz) MOPS (31.66 MIPS, 0.00 MFLOPS) MOPS (41.43 MIPS, 0.00 MFLOPS) 6.2. Obtained results using Stanford benchmark After executing Stanford in our platform simulator we collected the performance report plotted in Figs. 10 and 11 and summarised in Table 3. Fig. 10. Execution Time in Millisecond of the Ten Algorithms Included in Stanford Benchmark. Journal of Engineering Science and Technology October 2012, Vol. 7(5)

21 Embedded Microprocessor Performance Evaluation Case Study Fig. 11. Composite Performance of the Two Architectures for Nonfloating and Floating Point Applications. These results show a little enhancement made by the multiprocessor architecture, especially for complex algorithms such as Puzzle and Intmm. These results can be explained by the fact that Stanford benchmark was not written to exploit parallel architectures. The composite performance report shows that the Nonfloating point enhancement by multiprocessor adoption is about 14 %. On the other hand the enhancement made by multiprocessor adoption fore floating point application is about 55%. These results can be justified by the complexity of floating point algorithms that can exploit the multiprocessor architecture. Table 3. Results Obtained with the Two Platform Simulators for Stanford Benchmark. Mono-processor Architecture Bi-Processor Architecture Cycles Instructions Overall CPI CPU performance (50.0 MHz) MOPS (29.49 MIPS, 0.59 MFLOPS) MOPS (41.79 MIPS, 0.83 MFLOPS) Cache hit rate 96.5 % (99.8 / 75.1) 93.5 % (99.8 / 60.2) Simulated time ms ms After examining the simulator report we conclude that we have a 12 % performance gain for integer operation, and 32 % for floating point operations while using bi-processor configuration. This gain of performance is not equally distributed between the ten algorithms included in Stanford benchmark. The adoption of one of these two architectures will depend on the final application. Journal of Engineering Science and Technology October 2012, Vol. 7(5)

22 586 N. Litayem et al Results obtained using paranoia benchmark After executing paranoia benchmark under the two platform simulators we conclude that the FPU operation is correctly executed for the two architectures. But the benchmark reports that we have: This type of failure is not so dangerous for the system functionality but can cause some precision loss. The source of this failure is certainly caused by the code generation in the soft-float parameters of GCC compiler Analysis of the obtained results Performance results show a performance improvement from 0 to 55% using multicore architectures. This is especially due to the non-optimization of both benchmarks and compiler for multicore architectures. Better improvement is shown using complex algorithm such as puzzle or Mm. This fact must be considered during the algorithm development to adequately exploit the multicore aspects. 7. Conclusion And Perspectives The reported benchmarks results cover three computing system performance fields. Dhrystone is used to compare integer unit performance. Stanford is able to compare different standard algorithm performance execution both in integer and floating point computing. Paranoia is able to characterize floating point operations. We observed that the performance gain for multicore architecture is not so considerable since the used benchmarks were not designed to exploit the multicore architecture. The same approach can be used to compare performance of other architectures, but this kind of work can be done carefully since a few studies report some fragility s of SPEC CPU95 and CPU2000 [20] which is a superset of our used benchmarks. In our work we focused on hardware execution speed evaluation in the embedded system design flow. Modern embedded system has other performance aspects and factors that must be considered such as multiprocessor impact and RTOS overhead [21]. This work can be extended by multithreading these benchmark or use multiprocessor benchmark to compare the studied architectures [22]. In our future work we will focus on multiprocessor and operating system overhead. Multiprocessor performance evaluation can be done using a specific multicore and parallel benchmark such as SPLASH2 and NPB or by adapting standard benchmarks [22, 23]. In the other hand real-time overhead and RTOS comparison can be evaluated using standard benchmarks that evaluate the overall performance after the adoption of a specific operating system or by adopting dedicated benchmarks such as thread-metric [24]. Journal of Engineering Science and Technology October 2012, Vol. 7(5)

23 Embedded Microprocessor Performance Evaluation Case Study For this reason we attempt to extend our study by measuring the performance of ecos and comparing it with other RTOS such as RTEMS, uc/os and VxWorks using the same platform. Acknowledgment I would like to thank the National Institute of Applied Science and Technologies (INSAT) for the continued and positive support for scientific researches, and also thank the anonymous reviewers for their valuable comments. References 1. Peddersen, J.; Shee, S.L.; Janapsatya, A.; and Parameswaran, S. (2005). Rapid embedded HW/SW system generation. Proceedings of the 18th International Conference on VLSI Design, Sheldon, D.; Kumar, R.; Vahid, F.; Tullsen, D.; and Lysecky, R. (2006). Conjoining soft-core FPGA processors. IEEE/ACM International Conference on Computer-Aided Design, ICCAD '06, L Hours, L. (2005). Generating efficient custom FPGA soft-cores for controldominated applications. Proceedings of the 16 th International Conference on Application-Specific Systems, Architecture and Processors (IEEE ASAP 05), Huerta, P.; Castillo, J.; Jgnacio Martinez, J.I.; and Pedraza, C. (2007). Exploring FPGA capabilities for building symmetric multiprocessor systems. SPL ' , 3 rd Southern Conference on Programmable Logic, Groβschädl, J.; Tillich, S.; and Szekely, A. (2007). Performance evaluation of instruction set extensions for long integer modular arithmetic on a SPARC V8 processor. DSD th Euromicro Conference on Digital System Design Architectures, Methods and Tools, John, K.L.; and Eeckhout, L. (2006). Performance evaluation and benchmarking. CRC Press. 7. Wolf, W. (2007). High-performance embedded computing. Architecture, Applications, and Methodologies. Elsevier Inc. 8. Guthaus, M.R.; Ringenberg, J.S.; Ernst, D.; Austin, T.M.; Mudge, T.; and Brown, R.B. (2001). MiBench: A free, commercially representative embedded benchmark suite. WWC IEEE International Workshop on Workload Characterization, Ahmed, S.Z.; Eydoux, J.; Rougé, L.; Cuelle, J.-B.; Sassatelli, G.; and Torres, L. (2009). Exploration of power reduction and performance enhancement in LEON3 processor with ESL reprogrammable efpga in processor pipeline and as a co-processor. Design, Automation & Test in Europe Conference & Exhibition, Gaisler, J. (2002). A portable and fault-tolerant microprocessor based on the SPARC V8 architecture. Proceedings of International Conference on Dependable Systems and Networks, Massa, A.J. (2003). Embedded software development with ecos. Prentice Hall. 12. OpenCores. Journal of Engineering Science and Technology October 2012, Vol. 7(5)

24 588 N. Litayem et al. 13. RTEMS Real Time Operating System Yaghmour, K.; Masters, J.; Ben-Yossef, G.; and Gerum. P. (2003). Building embedded Linux systems. (2 nd Ed.) O Reilly Media. 15. Stitt, G.; Lysecky, R.; and Vahid, F. (2003). Dynamic hardware/software partitioning: A first approach. Proceedings of the 40 th Design Automation Conference (DAC), Henning. J.L. (2000). SPEC CPU2000: measuring CPU performance in the new millennium. Computer, 33(7), Assmann, U. (1996). How to uniformly specify program analysis and transformations with graph rewrite systems. In Proceedings of the CC'96. 6 th International Conference on Compiler Construction, Hillesland, K.E.; and Lastra, A. (2004). GPU floating point paranoia. In: Proceedings of 1 st ACM Workshop General-Purpose Computing on Graphics Processors (GP2 04). 19. Cross-Compiler for ecos: Vandierendonck, H.; and De Bosschere, K. (2004). Eccentric and fragile benchmarks. International Symposium on ISPASS, Performance Analysis of Systems, Proctor, F.M.; and Shackleford, W.P. (2001). Real-time operating system timing jitter and its impact on motor control. Proceedings of the SPIE Conference on Sensors and Controls for Intelligent Manufacturing II, 4563, Joshi, A.M.; Eeckhout, L.; and John, L.K. (2007). Exploring the application behavior space using parameterized synthetic benchmarks. 16th International Conference on Parallel Architecture and Compilation Techniques, PACT 2007, Jerraya, A.A.; Yoo, S.; Wehn, N.; and Verkest, D. (2003). Embedded software for SOC. Springer 24. Fletcher, B.H. (2005). FPGA embedded processors revealing true system performance. Embedded Systems Conference, San Francisco, Journal of Engineering Science and Technology October 2012, Vol. 7(5)

25 (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 2, No.2, February 2011 Building XenoBuntu Linux Distribution for Teaching and Prototyping Real-Time Operating Systems Nabil LITAYEM, Ahmed BEN ACHBALLAH, Slim BEN SAOUD Department of Electrical Engineering - INSAT, University of Carthage, TUNISIA {nabil.litayem, ahmed.achballah, slim.bensaoud}@gmail.com Abstract- This paper describes the realization of a new Linux distribution based on Ubuntu Linux and Xenomai Real-Time framework. This realization is motivated by the eminent need of real-time systems in modern computer science courses. The majority of the technical choices are made after qualitative comparison. The main goal of this distribution is to offer standard Operating Systems (OS) that include Xenomai infrastructure and the essential tools to begin hard real-time application development inside a convivial desktop environment. The released live/installable DVD can be adopted to emulate several classic RTOS Application Program Interfaces (APIs), directly use and understand real-time Linux in convivial desktop environment and prototyping real-time embedded applications. Keywords- Real-time systems, Linux, Remastering, RTOS API, Xenomai I. INTRODUCTION Real-Time embedded software become an important part of the information technology market. This kind of technology previously reserved to very small set of mission-critical applications like space crafts and avionics, is actually present in most of the current electronic usage devices such as cell phones, PDAs, sensor nodes and other embedded-control systems [1]. These facts make the familiarization of graduate students with embedded real-time operating systems very important [2]. However, many academic computer science programs focus on PC based courses with proprietary operating systems. This could be interesting for professional training, but inappropriate to the academicians because it limits students to these proprietary solutions. In the RTOS market, there are some predominant actors with industry-adopted standards. Academic real-time systems courses must offer to the students the opportunity to use and understand the most common RTOSs APIs. Actually, we assist to the growing interest of the real-time Linux extensions. In fact, they must be considered with great interests since each real-time Linux extension offers a set of advantages [3]. Xenomai real-time Linux extensions have the main advantages to emulate standard RTOS interfaces, compatibility with nonreal-time Linux. Such adoption can be very cost-effective for the overall system. In this paper, we present the interest of using Xenomai and Ubuntu as live installable DVD for teaching real-time operating systems and rapid real-time applications prototyping. Technical choices and benefits of the chosen solutions will be discussed. The remainder of this paper is organized as follows. Section 2 presents a survey of RTOS market and discusses both of the classic solution and the Linux-based alternatives. The Remastering solutions and available tools are detailed in Section 3. Section 4 describes the realization of our live DVD. Conclusions and discussion are provided in Section 5. II. SURVEY OF THE RTOS MARKET A. Classic RTOS and Real-Time API RTOS is an essential building block of many embedded systems. The most basic purpose of RTOS is to offer task deadline handling in addition to classic operating system functionalities. The RTOS market is shared between few actors. Each of them has its appropriate development tools, its supported target, its compiler tool-chain and its RTOS APIs. In addition, several RTOS vendors can offer additional services such as protocol stacks and application domain certification. According to the wide varieties of RTOSs, the designers must choose the most suitable one for their application domain. In the following, we will present a brief description of traditional RTOS and real-time API available in the embedded market. 1) VxWorks VxWorks [4] is a RTOS made and sold by Wind River Systems actually acquired by Intel. It was primary designed for embedded systems use. VxWorks continues to be considered as the reference RTOS due to its wide range of supported targets and the quality of its associated IDE. 2) PSOS This RTOS [5] was created in about It was widely adopted especially for Motorola MCU. Since 1999 PSOS has been acquired by Wind River Systems. 1 P a g e

26 (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 2, No.2, February ) VRTX VRTX [6] is an RTOS suitable for both traditional boardbased embedded systems and System on Chip (SoC). It was widely adopted for RISC microprocessors. 4) POSIX POSIX [7] (Portable Operating System Interface for Computer Environments), is a set of a standardized interface that provides source level compliance for RTOS services. B. Real Time Linux Alternatives According to the reference study [4], the market place of embedded Linux becomes more and more important. In 2007 their part was about 47% of the total embedded market. The same study anticipates that the market place of embedded Linux will be 70% in These facts can be justified by the growing availability resources in modern embedded hardware, the maturity of actual Linux kernels and applications and the cost reduction needs. Actually, there are many existing open source implementations of real-time extensions for Linux kernel, but we must note that various existing industrial solutions are based on those extensions with an additional value of support quality. Real-time Linux variants are actually successfully used in different applications [5]. Due to the increasing Linux popularity in the embedded systems' field, many efforts were spent and proposed to transform Linux kernel into a real-time solution. These works resulted in several implementations of real-time Linux. Actually, there are many existing implementations of real-time extension for Linux kernelextension [6]. They can be classified in two categories according to the approach used to improve their real-time performance of the Linux kernel. The first approach consists of modifying the kernel behavior to improve its realtime characteristics. The second approach consists of using a small real-time kernel to handle real-time tasks and what can run the Linux kernel as a low priority task. Actually, a lot of researches and industrial efforts are made to enhance the real-time capability of the various real-time Linux flavors' [3] for different perspectives and applications domain. These works can be classified in two categories. The first one is about scheduling algorithm and timer management. The second category is about application s domain such as Hardware-in-the-Loop simulation system, model based engineering [7] and real-time simulation. In Table I, we present some of the available open source many research Linux implementations. Linuxbased RTOS ADEOS ART Linux TABLE I. Description LINUX OPEN SOURCE RTOSS Adaptive Domain Environment for Operating Systems) [11], is a GPL nanokernel hardware abstraction layer created to provide a flexible environment for sharing hardware resources among many operating systems. ADEOS enables multiple prioritized domains to exist simultaneously on the same hardware. KURT QLinux Linux/RK RTAI Xenomai RT-Preempt Advanced Real-Time Linux [12], is a hard realtime Linux extension inspired from RTLinux and developed with robotics applications in mind. Real- Time is accessible from user level and does not require special device drivers. ART Linux is available for 2.2 and 2.6 Linux kernel. Kansas University's Real-Time Linux is a real-time Linux [13] extension developed by the Kansas University for x86 platforms. It can allow scheduling of events with a 10µs resolution. QLinux [14] real-time Linux kernel, is a Linux extension that focus and provide Quality of Service (QoS) guarantees for "soft real-time" performance in applications such as multimedia, data collection, etc. Linux Resource Kernel [15] is a real-time extension which incorporates real-time extensions to the Linux kernel. Real-Time Application [16] Interface usable both for mono processors and symmetric multi-processors (SMPs), that allows the use of Linux in many "hard real-time" applications. RTAI is the real-time Linux that has the best integration with other open source tools scilab/scicos and Comedi. This extension is widely used in control applications. Xenomai [17] is a real-time development framework that provides hard real-time support for GNU/Linux. It implements ADEOS (I-Pipe) micro-kernel between the hardware and the Linux kernel. I-Pipe is responsible for executing real-time tasks and intercepts interrupts, blocking them from reaching the Linux kernel to prevent the preemption of real-time tasks by Linux kernel. Xenomai provides real-time interfaces either to kernel-space modules or to user-space applications. Interfaces include RTOS interfaces (psos+, VRTX, VxWorks, and RTAI), standardized interfaces (POSIX, uitron), or new interfaces designed with the help of RTAI (native interface).these features made that Xenomai was considered as the RTOS Chameleon for Linux. It was designed for enabling smooth migration from traditional RTOS to Linux without having to rewrite the entire application. The RT-Preempt patch [18] converts Linux into a fully preemptible kernel. It allows nearly the entire kernel to be preempted, except for a few very small regions of code. This is done by replacing most kernel spinlocks with mutexes that support priority inheritance and are preemptive, as well as moving all interrupts to kernel threads. (Dubbed interrupt threading), which by giving them their own context allows them to sleep among other things. C. Selecting real-time extension for educational puropses Xenomai, RTAI and RT-Prempt are the most used realtime Linux extensions. According to the study [8], Xenomai and RTAI can provide interesting performances comparable to those offered by VxWorks in hard real-time applications. RTAI has the best integration with open source tools and can be remarkable for teaching control application. RT-Prempt has the privilege to be integrated to the mainline kernel. It offers the support of all drivers integrated into the standard kernel. Xenomai can provide the capability of emulating classic RTOS APIs with good real-time characteristics. It can be also fully compatible with RTAI. For these reasons, we focus on Xenomai to be the primary extension to integrate in our solution. 2 P a g e

27 (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 2, No.2, February ) Xenomai technology and ADEOS To make Xenomai tasks hard real-time in GNU/Linux, a real-time application interface (RTAI) co-kernel is used. It allows real-time tasks to run seamlessly aside of the hosting GNU/Linux system while the tasks of the regular Linux kernel are seen as running in a low-priority mode. This cohabitation is done using the previously presented ADEOS nanokernel and illustrated by figure 1. Figure 1. Concurrent access to hardware using ADEOS Based on the behavioral similarities between the traditional RTOS, Xenomai technology aims to provide a consistent architecture-neutral and generic emulation layer taking advantages from these similarities. This emulation can lead to fill the gap between the very fragmented RTOS world and the GNU/Linux world. ve Nati IX Process Domain B Hardware Syscall interface HAL Adeos Adeos/ I-Pipe POS VxWorks ps OS Real-time nucleus rt_task rt_task Figure 2. Xenomai Architecture rt_task Kernel space skin Domain A Xenomai Nucleus uit RON User-space applications Kernel-based applications VR TX User space userspace Kernel space Xenomai relies on the shared features and behaviors [9] found between many embedded traditional RTOS, especially from the thread scheduling and synchronization standpoints. These similarities are used to implement a nucleus that offers a limited set of common RTOS behavior. This behavior is exhibited using services grouped in high-level interfaces that can be used in turn to implement emulation modules of realtime application programming interfaces. These interfaces can mimic the corresponding real-time kernel APIs. Xenomai technology offers a smooth and comfortable way for real-time application migration from traditional RTOS to GNU/Linux. 2) RTOS emulation in the industrial field The fact that Xenomai can offer real-time capabilities in a standard desktop environment can be very useful in control system prototyping [10]. In this case, the desktop system which is running Xenomai can be used as X-in-the-loop to emulate the standard controlled equipments (electrical motors, power plants, etc.) in different phases of product prototyping and testing. Thus, since Xenomai can emulate the most classic RTOS API, we can easily port any application developed for this RTOS to it. Furthermore, it s covered by open-source license which has a very interesting cost advantage. Moreover, by Using Xenomai we can realize an easy migration to opensource solutions without having to rewrite previously developed RT applications for proprietary RTOS. Xenomai can also reduce the application price by offering the ability to cohabit them with standard time shared Linux applications, to benefit from all the software infrastructure of Linux combined to RT capability. 3) RTOS emulation in academic field Real-Time students must have a clear idea about various RTOS APIs. The cost of buying a large collection of classic RTOS to use them in the education field is not feasible. Xenomai also offers the capability of using different RTOS APIs, understanding the abstraction concept of them, manipulating the kernel/user spaces and learning about virtualization technologies. III. REMASTERING UBUNTU LINUX A. Interest of remastering Linux distributions A live CD or DVD allows any user to run different OS or applications without having to install them on the computer. To build a live CD/DVD, we must remaster an existing OS. Remastering is the process of customizing a software distribution. It is particularly associated with the Linux distribution world but it was extended to the majority of widely used OS. We can highlight that the most Linux distributions have been started by remastering another distribution. The term was popularized by Klaus Knopper, creator of the Knoppix Live Distribution, which has traditionally encouraged its users to modify his distribution in the way that satisfies their needs. Remastering OS can be used to make a full system backup including personal data to a live or installable CD, DVD or Flash disk that is usable and installable anywhere. It can also be exploited to make a distributable copy of an installed and customized operating system. B. Existing remastering software for Ubuntu Linux There are many remastering solutions of various Linux distributions. Our live/installable DVD is based on Ubuntu 3 P a g e

28 because this OS has gained a growing place in different application areas. The most known remastering solutions are such as Remastersys, Ubuntu Customization Kit, Reconstructor, Builder, ulc-livecd-editor and Disc Remastering Utility. Reconstructor and Ubuntu Customization Kit can make a personalized live system based on official image. The use of such approach is relatively complicated. The others' tools are focusing on the package installation and boot customizations. We adopted Remastersys since it is the most useful and powerful tool that we find in the available list of remastering solutions. IV. THE INTEGRATION OF XENOMAI IN UBUNTU INFRASTRUCTURE Xenomai is only related to the Linux kernel version. It s independent of the Linux distribution in which it will be run. The recent Ubuntu distributions integrate Xenomai as a default package. We have taken the choice of using Ubuntu as basic distribution because it inherits all the benefit of a Debian distribution in terms of reliability and the number of available packages. Ubuntu has also the best existing Multilanguage support. Many computer constructors propose Linux as an alternative operating system. This type of systems can be used as a framework for Model-Driven Engineering (MDE) in Control and Automation since the usage of a standard operating system such as Ubuntu can facilitate the integration of these tools. The realization of our Live DVD was conducted following the steps' bellow. A. Adding Xenomai functionality to Linux kernel In This step, we must firstly download the essential packages needed to configure and compile the Linux kernel. These packages are: build-essential, kernel-package, ncursesdev. They can be installed using synaptic or apt-get. Secondly, we must download both Linux kernel and its compatible Xenomai framework, patch the Linux kernel using the prepare-kernel tool included in Xenomai package, configure, compile it and add this kernel to the boot choices. For the actual release, we used the xenomai and the Linux The compilation and installation must preferably be realized using make-kpkg tools designed especially for Debian based distributions. After realizing these steps, we can boot a system running Linux kernel using ADEOS. B. Compiling Xenomai and running some samples The second step must begin by creating a Xenomai group and adding to it the appropriate users (XenoBuntu and root). We can actually configure, compile and install Xenomai and their examples, customizing available software by adding development environments. We adopted CodeLite, which is an Integrated Development Environment (IDE) designed for C and C++ development and Scilab/Scicos which can be used for control systems prototyping and real-time code generation. After rebooting our running system, we can boot to a usable system based on Xenomai through it, we can test some realtime examples based on different standards API. (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 2, No.2, February 2011 C. Transform our running system in a live DVD The final step is to remaster our running real-time system using Remastersys package. Before we move to the explanation of this stage, we give a brief description of this tool. In fact, it s a Debian oriented remastering tool, controllable using command line or Graphic User Interface. It enables the creation of live installable CDs or DVDs, including all the software available in the installed system. We can choose to include or not our personal data by choosing between dist and backup parameters. We must add Remastersys repository, install and use it to remaster our system to obtain an.iso burnable image usable as live installable DVD. This phase is the easiest step in the realization thanks to the simplicity of Remastersys usage and the wide choice of parameters offered by this package. D. Testing the real-time characteristics of our system The realized system can be used as live DVD or installed in a standard PC architecture. The real-time performances may vary depending on the used architecture. To have a clear idea about reached performances by deployment platform, Xenomai offers a set of benchmarks able to test different realtime aspects of the system. The most important benchmarks are described in Table II. Benchmarks Switchtest Switchbench Cyclictest Clocktest This kind of solution offers the possibility to work with real-time Linux without losing the contact with classic RTOS knowledge s. It can be a very interesting way to introduce real-time and embedded Linux word especially when considering that Xenomai is actually used by various companies such as Sysgo in their ELinOS solution. 4 P a g e TABLE II. Description XENOMAI ASSOCIATED BENCHMARKS Can test thread context switches. Can measure the contexts switch latency between two real-time tasks. Can be used to compare configured timer expiration and actual expire time. Can be used to repeatedly prints a time offset compared to reference gettymeofday(). These benchmarks can be used to familiarize students with real-time performance evaluation and their different associated metrics. Such can be illustrated by the evaluation of the impact of real-time enhancements into the overall system performances. V. CONCLUSION The realized live/installable DVD can be used both in education or system development. The main contribution of such solution is to have a ready to run system, which minimizes the time of selecting and including different needed software components. This system can be enhanced and remastered after its installation and can be tuned by inclusion of new components to meet specific application needs.

29 (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 2, No.2, February 2011 Considering that this distribution does not take the advantage of the two other predominant real-time Linux extensions (RTAI and RT-PREEMPT). We plan to extend our distribution with these two extensions by including multi configuration boot capability, which can allow the user to choose between these three alternatives. In the other hand, we plan to include and explore other open-source components that can be used for real-time applications design and code generation such as Topcased, Openembedd and Beremiz. REFERENCES [1],, 2006 [2], J. Ganssle, The Art of Designing Embedded Systems, [3] G. Sudha, et al., Enhancing Student Learning with Hands-On RTOS Development in Real-Time Systems Course, in 38th ASEE/IEEE Frontiers in Education Conference, 2008, pp. S2H-11 - S2H-16 [4] N. Vun, et al., Real-time Enhancements for Embedded Linux, in IEEE International Conference on Parallel and Distributed Systems, 2008, pp [5] Wind River, VxWorks Program guider, Tsinghua University Press, August 2003 [6] L Bacon, E Becquet, E Gressier-Soudan, C Lizz Provisioning QoS in Real-Time Distributed Object Architectures for Power Plant Control Applications, April 30th 2000 [7] Ready, J., "VRTX: A Real-Time Operating System for Embedded Microprocessor Applications," IEEE Micro, pp. 8-17, Aug [8] M. A. Rivas and M. G. Harbour, \Evaluation of [9] new posix real-time operating systems services [10] for small embedded platforms," in Proceedings [11] of the 15th Euromicro Conference on Real-Time [12] Systems, (Porto, Portugal), July [13] Z. Davis, Snapshot of the embedded Linux market, Available: [14] S. Kume, Kanamiya, Y, Sato D, Towards an open-source integrated development and real-time control platform for robots, IEEE, vol. International Conference on Robotics and Biomimetics, pp , [15] M. T. Jones, Anatomy of real-time Linux architectures From soft to hard real-time, IBM [16] G. Doukas and K. Thramboulidis, A Real-Time Linux Based Framework for Model-Driven Engineering in Control and Automation, [17] A. Barbalace, et al., Performance Comparison of VxWorks, Linux, RTAI, and Xenomai in a Hard Real-Time Application, IEEE Transactions on Nuclear Science, pp , [18] K. Yaghmour, et al., Building Embedded Linux Systems, Andy Oram ed.: O Reilly, [19] B. W. Choi, et al., " Real-time control architecture using Xenomai for intelligent service robots in USN environments, Intelligent Service Robotics, pp , [20] Z. Chen, X. Luo, Z. Zhang, Research Reform on Embedded Linux s Hard Real-time Capability in Application, Embedded Software and Systems Symposia, ICESS Symposia '08. International Conference on July 2008 Page(s): [21] S. Dietrich and D. Walker, The evolution of real-time linux, In Proc. 7th Real-Time Linux Workshop, [22] B. Srinivasan, S. Pather, R. Hill, F. Ansari, and D. Niehaus, A Firm Real-Time System Implementation Using Commercial Off The Shelf Hardware and Free Software, IEEE RealTime Technology and Applications Symposium, June [23] V. Sundaram, A. Chandra, P. Goyal, P. Shenoy, J Sahni, and H Vin, Application Performance in the QLinux Multimedia Operating System, In Proceedings of the Eighth ACM Conference on Multimedia, Los Angeles, CA, November [24] S. Oikawa and R. Rajkumar, Linux/RK: A portable resource kernel in Linux, In Proceedings of the IEEE Real-Time Systems Symposium Work-In-Progress, Madrid, December1998. [25] D. Beal, E.Bianchi, L. Dozio, S. Hughes, P.Mantegazza, and S. Papacharalambous, RTAI: Real-Time Application Interface, Linux Journal, [26] P. Gerum, Xenomai - Implementing a RTOS emulation framework on GNU/Linux, Whitepaper, [27] A. Siro, C. Emde, and N. McGuire, Assessment of the realtime preemption patches (RT-Preempt) and their impact on the general purpose performance of the system, In Proceedings of the 9th Real- Time LinuxWorkshop, AUTHORS PROFILE N. LITAYEM received the Dipl.Ing. and M.S. degrees in electrical engineering from National School of Engineer of Sfax (ENIS), Tunisia, in 2005 and 2006, respectively. He received the MS. degree in Embedded Systems Engineering from the National Institute of Applied Science and Technologies, Tunisia in Currently, he is a Ph.D student with the Laboratoire d'etude et de Commande Automatique de Processus (LECAP) at the university of Carthage (INSAT-EPT). His research interests are the reliable control of electrical drives using FPGA technologies. A. BEN ACHBALLAH received the BSc degree in Electronics from Bizerte s Faculty of Sciences in 2007 and the MSc degree in Instrumentation and Measure from the National Institute of Applied Sciences and Technology of Tunis (INSAT) in Currently, he is a PhD Student with the Laboratoire d'etude et de Commande Automatique de Processus (LECAP) at the Polytechnic School of Tunisia (EPT). His research interests include FPGA-based simulators for embedded control applications, simulation methodologies for network-on-chips and high level synthesis technique. S. BEN SAOUD (1969) received the electrical engineer degree from the High National School of Electrical Engineering of Toulouse/France (ENSEEIHT) in 1993 and the PhD degree from the National Polytechnic Institute of Toulouse (INPT) in He joined the department of Electrical Engineering at the National Institute of Applied Sciences and Technology of Tunis (INSAT) in 1997 as an Assistant Professor. He is now Professor and the Leader of the Embedded Systems Design Group at INSAT - University of Carthage. His research interests include Embedded Systems Architectures, real-time solutions and applications to the Co-Design of digital control systems and SpaceWire modules. 5 P a g e

30 International Journal of Computer Applications ( ) Volume 17 No.3, March 2011 Impact of the Linux Real-time Enhancements on the System Performances for Multi-core Intel Architectures Nabil Litayem Technologue Professor ISET-Bizerete, Menzel Abderahmène 7035 Slim Ben Saoud Conference Professor LECAP, EPT-INSAT, Centre urbain nord, BP 676 Tunis cedex, Tunisia ABSTRACT Embedded Linux became a dominant choice in the embedded entertainment and mobile systems. Their adoption in widely used control applications is the second phase of their embedded market domination. One of the most important criteria of the control RTOS is their determinism/overhead ratio. Actually, many extensions exist to bring real-time capability into the Linux kernel. On the other hand standard computer architecture become widely adopted in the embedded market, with a large variety of performances and power requirement. In this paper, we study the impact of timing enhancement offered by various real-time Linux kernel extensions and their impact into the overall system performance. The obtained results are compared with the standard and server kernels performances. We used for our study a multi-core Intel based architecture since we considered the trend of the embedded control market for this kind of architectures. In our work we studied two metrics to reflect the performance of the studied kernel that are latency and throughput. Such work can be used to orient the adoption of real-time Linux extension for a given hardware architecture to reach control application requirements. General Terms Real-time, Benchmark, Linux Keywords Real-time Linux, Xenomai, LowLatency, PREEMPT-RT, control applications 1. INTRODUCTION Modern control applications require much newer functionality like GUI (Graphic User Interface), communication possibilities and great software reusability. Traditional RTOS (Real Time Operating System) can reach the timing performance but have many weaknesses concerning nowadays required aspects. Various traditional RTOS tried to enhance their functionality by offering additional software components through additional costly licenses. On the other hand standard Linux kernel obeys the other's requirements but cannot be used as a hard real-time operating system. These reasons impulses several initiatives to integrate real-time capabilities into the Linux kernel, which can make of Linux a very serious candidate in the embedded systems field. Actually, many approaches are available to offer these functionalities using different architectures [1], [2]. The most adopted solutions are RTLinux/RTCore, RTAI, Xenomai and PREEMPT-RT [2] patch. Each one of these real-time enhanced kernel has their internal architecture, their strength and weaknesses. The widely available choice in terms of timing performances and functionalities among different Linux kernel variants makes of Linux one of the most suitable embedded operating systems, widely adopted for different embedded applications with different constraints range. Actually, PREEMPT-RT patch is finally mainlined in the current kernel and used by great real-time field actors such Wind River in their Linux4 solution. Xenomai [3] is another successful real-time project widely adopted in hard real-time application. This extension is actually adopted by Sysgo company with their real-time Linux solution called ELinOS. Moreover, few works tries to merge Xenomai with PREEMPT- RT, in the original solution baptized Xenomai/Solo, which port Xenomai capabilities to PREEMPT-RT patch. On the other hand, the embedded processing requirements are increasing at an exponential rate. The supply in terms of embedded processors is becoming increasingly broad. Different platforms can be adopted and used in the embedded field, classically FPGA and DSP architectures are widely adopted in the embedded high performance field. Actually, we assist in the convergence of PC and embedded architecture. Different conventional microprocessor actors try to enlarge their activities with processor which can be used in both standard and embedded computer. Intel and AMD with their respectively ATOM TM and GEODE TM processor are considered as interesting candidates in the embedded field. Other high end processors designed to desktop and server systems become adopted in industrial computer designed by many great embedded control actors such as Siemens and National Instruments. These processors are used by various manufacturers in industrial control or for hardware-in-the-loop applications. These kinds of architecture are often adapted to offer special robust peripheral enhanced to work in industrial environments. In this paper, we try to investigate the usability of industrial computer for real-time control applications using various realtime Linux extensions. For this goal, we evaluate the real-time performance and throughput of Xenomai, PREEMPT-RT, Lowlatency, standard and server kernel. Timing performance are measured using Cyclictest, Unixbench are used for throughput evaluation. Our timing performance tests are released under a workload generated using hackbench benchmark. These 17

31 International Journal of Computer Applications ( ) Volume 17 No.3, March 2011 performance evaluation tools were adopted after a qualitative comparison of various tools. This approach is adopted to study the timing performance of Linux and its impact on the overall system performance for both single and dual-core systems. The studied platform is a standard computer with CoreTM 2 Duo Intel microprocessor, 3Go of DDR2 RAM and based on Ubuntu Linux Such a platform is similar to high-end industrial computers designed for control purpose. The following paper is organized as follows. Section 2 presents a survey of dominant real-time open source Linux solutions. Qualitative comparison of performance evaluation tools are presented in section 3. We studied the latency of different kernel versions using cyclictest time measurement program in Section 4. The system performance evaluation is presented in Section 5 for both single-core and dual-core system. Conclusions and discussion are related in Section 6 2. REAL-TIME LINUX EXTENSIONS Computer system could be considered as a real-time system if the time is a dimension of the correctness. The most important aspect of such system is deadline meeting. Linux is a general-purpose operating system designed for desktop and server usage. Its kernel was previously designed to guarantee the best resource allocation for all executed processes. Desktop and server Linux kernels use the CFS (Completely Fair Scheduler). This scheduler is not adapted to real-time systems since they are characterized by unfairness. Their successful adoption in these two fields pushed various embedded system actors to extend their usage into embedded systems field. Such adoption has a good recognized effect into the embedded field but the kernel hasn t the required timing performance for realtime applications. Many academics and industrials efforts were made and proposed to enhance the Linux kernel with real-time functionalities. Actually there are several existing implementations of real-time extension for Linux kernel [4]. 2.1 Real-time Linux technology Academic research and industrial efforts have created several real-time Linux implementations [5], [6]. These extensions can be categorized into two categories according to the approach used to improve the timing performance. The first approach consists of modifying the kernel behavior to improve its real-time characteristics, by reducing the durations of high priority task. Real-Time Tasks Hardware Micro-Kernel User-Space Linux Kernel Fig. 1: Microkernel based real-time Linux The second approach consists of using small real-time kernel to handle real-time tasks and who can run the Linux kernel as a low priority task. The idea behind this approach is illustrated by Figure 1. The most known projects using this technology are RTAI and Xenomai. These two projects are built behind ADEOS that allow the creation of multiple domains. ADEOS are also responsible for interrupt management, as every triggered interrupt is oriented to its registered domain. However, if one interrupts without knowledge of ADEOS is received by one domain it s systematically forwarded to the next domain in the ADEOS pipe. Figure 2 shows the interrupt management of ADEOS based real-time Linux. Real-Time Tasks Real-Time Kernel Interrupt Pipe Hardware ADEOS Hardware interrupts User-Space Linux Kernel Fig. 2: ADEOS based real-time Linux 2.2 Main real-time Linux solutions There has been noteworthy works to transmute Linux into hard or soft real-time operating system. These works are essentially based into one of the previously presented technology. In this section the most popular implementation of these technologies will be discussed Preemptible Kernel (lowlatency) This extension was previously developed as an external patch called preempt-kernel by Robert Love [7]. Since 2.5 kernel version preempt-kernel patch was incorporated into the mainline kernel to offer better reactivity qualities. Thanks to this extension every process may be scheduled out practically everywhere in the kernel. This project was initiated by the transformation made to the Linux kernel for SMP (Symmetric Multi-Processor) support. Such support required the critical section protection from concurrent access to process running on distinct CPUs. This protection was realized using a spinlocks. These spinlocks are used to protect the kernel areas from concurrent access. Such areas are nearly the same that must be protected to offer a reentrant kernel PREEMPT-RT The PREEMPT-RT patch is the most successful Linux modification that transforms the Linux into a fully preemptible kernel without the help of microkernel [8]. It allows almost the whole kernel to be preempted, except for a few very small regions of code. This is done by replacing most kernel spinlocks with mutexs that support priority inheritance and are preemptive, as well as moving all interrupts to kernel threads. 18

32 Kernel space skins user space International Journal of Computer Applications ( ) Volume 17 No.3, March 2011 (Dubbed interrupt threading), which by giving them their own context allows them to sleep among other things. This patch presents new operating system enrichments to reduce both maximum and average response time of the Linux kernel. These enhancements are progressively added to the Linux kernel to offer real-time capabilities. The most important enhancements are: High resolution timers Complete kernel preemption Interrupts management as threads. Hard and soft IRQ as threads Priority inheritance mechanism Some of these new features like Threaded IRQ are currently pushed to the mainline kernel by the patch maintainers RTAI RTAI is a real-time application [9] interface usable for both uniprocessors and symmetric multi-processors (SMPs). This extension allows the usage of Linux in many "hard real-time" applications. As an option, RTAI's "LXRT" allows the control of real-time tasks, using all of RTAI's hard real-time system calls, from within Linux memory-protected user space resulting in soft real-time combined with fine-grained task scheduling. RTAI is the real-time Linux that has the best integration with others open source tools scilab/scicos [10], [11]. This extension is widely used in control applications Xenomai Xenomai [12] is a real-time development framework that can be integrated with the Linux kernel to provide hard real-time support. The current version is based on dual kernel approach. It implements ADEOS (I-Pipe) micro-kernel between the hardware and the Linux kernel. I-Pipe is responsible for executing real-time tasks and intercepts, interrupts, blocking them from reaching the Linux kernel to prevent the preemption of real-time tasks by Linux kernel. Figure 3 illustrate the functional behavior of the ADEOS/I-Pipe with the case of Xenomai implementation. The resulting system is composed from Linux and small co-kernel running side by side on the same hardware. Xenomai co-kernel exclusively controls the real-time applications and real-time interfaces either to kernelspace modules or to user-space applications. Process Domaine B Linux Hardware Adeos/ I-Pipe rt_task Domain A rt_task Xenomai Nucleus rt_task Syscall interface Native POSIX VxWork psos uitro VRTX s N Real-time nucleus SAL/HAL Adeos User-space applications Kernel-based applications Fig. 4: Xenomai skins architectures These interfaces called skins can mimic psos+, VRTX, VxWorks, POSIX, uitron and RTAI API. Due to this feature, Xenomai was considered as the RTOS chameleon. It was designed to enable smooth migration from a traditional RTOS to Linux without having to rewrite the entire application. Figure 4 illustrate the Xenomai skins architecture and show that almost skis are equivalent to the Native skins. On the other hand Xenomai support a wide range of architecture (PowerPC32 and PowerPC64, Blackfin, ARM, x86, x86_64, and ia64). 3. REAL-TIME PERFORMANCE MEASURING AND BENCHMARKING Real-time computer system has three performance aspects that must be monitored to reveal the overall system performance [13], [14]. These three aspects are real-time performance, throughput and stability. Such work can be done using real-time measurement programs, benchmarks and stress tools. The obtained results are generally used to measure, analyze and improve both hardware and software architecture by manipulating various factors. 3.1 Real time measurement program To reflect real-time operating system health various measurement programs exist. Each one has its approach and focalizes in a well determined performance aspect. Table 1, resume the most important real-time measurement programs. The most important feature for such systems is to provide determinism. Other features such response time, scheduler robustness, protection from priority inversion, offered preemption mechanisms etc., can be considered as quality metrics. Each one of these programs is able to evaluate a separate or a set of factors. However, the worst-case execution time and jitter can resume the overall timing performance. Cyclictest can be used to measure these two metrics by measuring the time between configured timer expiration, and the actual expire time. For this reason we decided to adopt Cyclictest for the rest of this work to reflect the real-time performance of our system. Fig. 3: Interrupt management in Xenomai 19

33 International Journal of Computer Applications ( ) Volume 17 No.3, March 2011 Table 1. Real-time measurement program Real-time measurement program Lpptest RTMB RealFeel Cyclictest LRTBF Houglass Senoner test Bytemark Description Benchmark included in the PREEMPT- RT patch that measure the interrupt latency received on the parallel port. Micro-benchmark suite, designed to compare many of the common metrics of real-time performance across several platforms and several languages (C, C++ and Java). ANSI/C program that test of how well a periodic interrupt is processed. ANSI/C program that measure the scheduling latency of the Linux kernel. Cyclictest recurrently goes to sleep for a certain time interval and measures the actual duration of the sleep to infer the latency. A benchmarking Framework composed of a set of drivers and scripts for evaluating the performance of various real-time additions for the Linux kernel A synthetic real-time application that can be used to learn how CPU scheduling in a general-purpose operating system works at microsecond and millisecond granularities. A latency benchmark designed to analyze the Linux behavior under under high system load. CPU benchmark suite, reporting CPU, cache, memory, integer and floatingpoint performance 3.2 Benchmarking programs Real-time computer system can be compared with their relative performance. This can be done by running a number of standard tests and trials against it. Benchmark program results are essentially dependents from the hardware but the software execution environment has a remarkable impact on the obtained results. The main purpose of Benchmarks is to offer a way of comparing the performance of several subsystems through different hardware/software architectures. Each benchmark is able to cover various sets of system performances. In the context of Linux based computer system, various communities or industrial benchmark are available for different computing purpose. Table 2 resume the most used open source Linux Benchmarks. Benchmarking program hackbench Lmbench Table 2. Open source Linux Benchmark Description ANSI/C benchmark designed to measure the performance, overhead, and scalability of the Linux scheduler. ANSI/C microbenchmarks designed to measure latency and bandwidth. UnixBench IOZone ANSI/C benchmark designed to provide a basic indicator of the performance of a Unix-like system. UnixBench can measure various aspects of the system's performance and support multi-cpu systems. ANSI/C filesystem benchmark that generates and measures a variety of file operations Stress programs Stress programs are in general used to test the stability of a computer system in the building and tuning purposes. For the Linux kernel several stress programs are used to validate every kernel release. Each one of these programs, recapitulated in table 3 can cover several aspects of the kernel functionalities. Stress program dohell Stress Calibrator Cpu Burn Table 3. Open source Linux stress program Description Script based on previously presented hackbench benchmark and the dd command, that heavily load the entire system Simple ANS/ C program that can impose a configurable amount of CPU, memory, I/O, and disk stress on POSIX-compliant operating systems Small ANSI/C program designed to extract the cache memory, main memory and TLB parameters Stress program, designed to heavily load CPU chips. 4. REAL-TIME CHARACTERISTICS 4.1 Timing performance evaluation Measuring real-time performance of a Linux based operating system can require various aspect investigations of the studied system. The most important aspect of such system is WCET (Worst Case Execution Time) and Throughput. Table 1 show a recapitulation of the most adopted benchmarks and test programs. Cyclictest benchmark can be used with different parameters to determine the latency of various samples or only the average and maximum latency. In our case, we used the verbose mode to study statistically the latency and the silent mode to determine the average and the maximum latency. The obtained results for the studied kernels are plotted in Figure 5 to Figure 9. 20

34 International Journal of Computer Applications ( ) Volume 17 No.3, March 2011 Fig. 5: Statistic latency results of the Xenomai patched Linux kernel Fig. 8: Statistic latency results of the generic Linux kernel Fig. 6: Statistic latency results of the PREEMP-RT patched Linux kernel Fig. 7: Statistic latency results of the low latency Linux kernel Fig. 9: Statistic latency results of the server Linux kernel 4.2 Interpretation The earlier presented results show that the average response time of the five studied kernels is around 10µsec. The best maximum latency is obtained with Xenomai which are about 15µsec. This result can be justified by the architecture of Xenomai that separate real-time and Linux domains. On the other hand the obtained result with PREEMPT-RT can encourage the usage of this kernel for hard real-time applications since the maximum latency is about 62µsec. The main advantage of such solution is their entire compatibility with the classic Linux applications. Low latency kernel show better average results than the standard kernel but their maximum latency can be a serious limitation for its adoption in hard real-time applications. Standard and server kernel are given as a reference result for other real-time enhanced kernel. 21

35 International Journal of Computer Applications ( ) Volume 17 No.3, March SYSTEM PERFORMANCE 5.1 System performance evaluation and benchmarking New improvements in computer technology announce miscellaneous requirements and constraints for system performance evaluation, especially with the emergence of multicore architectures. System performance evaluation can be very helpful in the design-flot since it reflect the performance of a whole system, including all its aspects. Several system benchmark suites exist. This benchmark can be classified as follows: CPU benchmark Embedded and media benchmark. Language specific benchmark Transaction processing benchmark Web server benchmark Domain specific benchmark Every benchmark from the presented categories should be representative of the applications that can run on the studied systems. These different categories and its relevant benchmark are detailed in the book [12]. Actually, with the convergence of desktop and embedded systems, system benchmark can be used. The most useful benchmarks in our case are LMBench, UnixBench and Nbench. We adopted UnixBench for the rest of our work since this benchmark is updated to support multiprocessor system and has a great portability under different UNIX systems. 5.2 UnixBench UnixBench is designed to extract a basic performance indicator of a UNIX system. Various aspects of the system are reflected using an index to compare the performance of the current system to a reference system. The entire set of index values is then combined to make an overall index for the system. UnixBench can also handle Multi-CPU systems. The advantage of this benchmark in our study is its capability to reflect the performance of the overall system (including the operating system and used compiler) not only the available hardware which is the case of real systems. The individual performance reports indicate the performance of the system in a different specific domain like integer or floating point computation. The system benchmark score indicates the performance of the global system. For the two reports, the upper score indicates better performance. 5.3 System performances using single processor and 2 cores UnixBench can detect the various CPU available on the studied system and parallelize its different benchmark on these CPUs. The reported values are given for both single and multi-core configuration. Since we used a dual core processor they obtained results illustrated in Figure 11 and 12 are for single-core and dual-core. The obtained results are an attributed score to the whole computer system, computed according the result of various internal benchmarks such as Dhrystone and Whetstone. Fig. 10: System Benchmarks score for different studied kernel running under a single-core. Fig. 11: System Benchmarks score for different studied kernel running under a dual-core. 5.4 Interpretations The measured values for single-core architecture show global performances degradation caused by real-time capabilities for PREEMPT-RT and lowlatency kernel in the order of 16 % compared to the standard Linux kernel. Xenomai patched kernel is shown better global performance than standard Linux kernel. This result can be explained by the deactivation of the power management and frequency scaling in the Xenomai patched kernel. On the other hand the xenomai results are obtained with unloaded real-time domain. The dual-core architecture shows a considerable degradation for the PREEMPT_RT patch in the order of 49% compared to the standard Linux kernel. More else, we can consider that the performance of dual-core architecture is 9% higher than singlecore architecture for this kind of PREEMPT-RT kernel. A wiser choice can be the adoption of higher performance single-core processor instead of dual-core. These kind of results can be explained by the maturity of PREEMPT-RT multi-core support. 22

36 International Journal of Computer Applications ( ) Volume 17 No.3, March CONCLUSIONS This paper provided a comparative study of various real-time enhanced Linux kernels. Our results show that Xenomai and PREEMPT-RT have a comparable performance of monoprocessor system with a little superiority of Xenomai booth for latency and throughput. We concluded that the adoption of PREEMPT-RT can be a wiser choice for monoprocessor realtime system due to the smooth migration of application development from standard Linux to PREEMPT-RT. On the other hand the multiprocessor results show a clear degradation of the obtained results for PREEMPT-RT patch that can be an obstacle for the PREEMPT-RT adoption of such systems. Lowlatency Linux kernel can be a serious candidate for soft real-time application in a multiprocessor environment since this extension shows a good behave under such an environment, additionally LowLatency has the same programming model as standard Linux kernel. As follows up to this work, we plan to investigate the capabity of Xenomai/Solo solution who try to merge PREEMPT-RT and Xenomai solution. 7. REFERENCES [1] K. Yaghmour, G. Ben-Yossef, and P. Gerum, Building Embedded Linux Systems, O Reilly [2] M. Mossige, P. Sampath, and R. Rao, Evaluation of Linux rt-preemptfor embedded industrial devices for Automation and Power technologies Acase study, In Proceedings of the Ninth Real-Time Linux Workshop in Linz, November [3] P. Gerum, Xenomai - implementing a rtos emulation framework on gnu/linux, 2004, [4] M. Tim Jones, Anatomy of real-time Linux architectures From soft to hard real-time, IBM 15 Apr 2008 [5] Z. Chen, X. Luo, Z. Zhang, Research Reform on Embedded Linux s Hard Real-time Capability in Application, Embedded Software and Systems Symposia, ICESS Symposia '08. International Conference on July 2008 Page(s): [6] S. Level, Anatomy of real-time Linux architectures From soft to hard real-time, in IBM, 2008 ed: IBM, [7] Arnd C. Heursch, Dirk Grambow, Dirk Roedel and Helmut Rzehak, Time-critical tasks in Linux 2.6, concepts to increase the preemptablity of Linux kernel, Linux Automation Konferenz, University of Hannover, Germany, March [8] Dongwook Kang, Woojoong Lee, and Chanik Park, Kernel Thread Scheduling in Real-Time Linux for Wearable Computers, ETRI Journal, Volume 29, Number 3, June [9] F. Jiang, S. Gao Jie Zhang, A Hardware-in-the-loop Simulation System of Diesel, Power and Energy Engineering Conference, APPEEC Asia- Pacific Engine Based on Linux RTAI March 2009 Page(s):1-4. [10] R. Bucher, S. Balemi, Scilab/Scicos and Linux RTAI - a unified approach, Control Applications, CCA Proceedings of 2005 IEEE Conference on Aug Page(s): [11] G. Doukas, and K. Thramboulidis, A Real-Time Linux Based Framework for Model-Driven Engineering in Control and Automation, IEEE Industrial Electronics, Volume PP, 2009 Page(s):1-11. [12] Byoung Wook Choi, Dong Gwan Shin, Jeong Ho Park, Soo Yeong Yi, Seet Gerald, Real-time control architecture using Xenomai for intelligent service robots in USN environments, Springer Intel Serv Robotics 2009 Page(s): [13] L. Kurian, J. Lieven Eeckhout, Performance Evaluation and Benchmarking, Published in 2006 by CRC Press. [14] Paul J Fortier, Howard E. Michel, Computer Systems Performance Evaluation and Prediction, Digital Press

37

38

39

40

41

42

43

44

45 DSC Performance Evaluation and Exploration Case of TMS320F28335 Imene MHADHBI #1, Nabil LITAYEM *2, Slim BEN OTHMEN #3, Slim BEN SAOUD #4 # LSA Laboratory, INSAT-EPT, University of Carthage, TUNISIA 1 [email protected] 3 [email protected] 4 [email protected] * Department of Computer Science, College of Arts & Science, Salman Bin Abdalaziz University, KSA 2 [email protected] Abstract The rapid advanced of microelectronics and semiconductor technologies has enabled to increase the capacity of digital circuits like Application-Specific Integrated Circuits (ASICs), microcontrollers (MCUs), Digital Signal Processors (DSPs) and Field Programmable Gate Array (FPGAs). Recently, a new digital circuits termed "Digital Signal Controller" (DSCs) has emerged. DSC combines the processing power of the DSP and the functionality of the MCU with several peripheral modules that make it an attractive proposition for practically all embedded systems applications, including communication, audio, medical, aerospace, defence and industrial control. The performance analysis of DSCs processors, presents one of the consideration metric in the choice the best processing element for a special application. In this paper, we will focus on the performance analysis of the TMS320F28335 DSC, basing on benchmarking. Keywords Embedded Systems, DSC, Performance, Benchmarking. I. INTRODUCTION Embedded systems are becoming one of the important factors of the e-industries growth. They are present in practically all human activities such as cellular telephones, personal digital assistants (PDAs), digital cameras, GPS receivers etc. Semiconductor markets have responded to this demand with a bewildering of other solutions for processing such as ASIC (Application Specific Integrated Circuits), FPGAs, DSP, DSC and SoC (System-On-Chip). Initially, embedded control systems were implemented on microcontrollers (MCUs) due to their small size, efficient input/output communication port and their abilities to perform control applications. In the same era, DSPs are used in telecommunication, image processing and signal processing applications. To improve embedded systems performances, MCUs manufactures tried to increase the date bit size from 8 to 16 bits. Similarly, DSPs manufactures began to include more controllers to have the capacity to be called DSP controller (DSC)[1]. Different studies prove that DSC, an embedded controller with a specific microprocessor designed for typical mathematical operations to manipulate measured digital data, is capable of processing data speedily and generate output data in real-time. DSCs systems can accomplish complex and sophisticated embedded applications that can not be implemented using other processors techniques. They can be used in different application such as image processing [2], digital control processing [3, 4, 5], speech synthesis [6] and control implementation [7, 6]. DSCs integrate the algorithm processing power of a DSP engine with the hard, real-time control abilities of a MCU [1]. Its additional hardware units permit speed up the computational of sophisticated mathematical operations in order to reduce their memory capacity and the number of execution cycles in the processor. DSCs are compact with specific hardware and good performance for the best cost/benefit/performance. Table 1 presents an illustration of the different features of the DSCs comparing to the MCUs and the DSPs processors. TABLE I MCUS, DSPS AND DSCS PROCESSORS FEATURES [1] Features MCUs DSPs DSCs Execute From Flash Large Register Set Robust Interrupt Capability Abundant Mixed Signal Single-Cycle MAC Dual-operand Fetch Zero-Overhead Fetch Saturation/Rounding Bit-Reverse Modes Algorithms complexities require that the designer must have a clear idea about the hardware computing performance of the used DSC processor. Currently, there are many DSCs manufactures with many families. The challenges of designers, and especially new users of the DSCs field, are faced up with the various problems in selecting the appropriate processor to implement the most efficient algorithm on the least expensive hardware within given time. Choosing the correct system, can be based on a comparison of the performance of each processor to save energy, money and minimize the risk of the too time to market. Many interesting studies based on the hardware performance evaluation of processors are presented [8, 9, 10,

46 11, 12] to allow fair comparisons between processors. Some of them are based on the comparison of the area, energy and cost consumption in a specific applications. Others used performance evaluation techniques to evaluate or compare processors efficiency. The goal of this paper is to evaluate the performance of the TMS320F28335 DSC processor. The remaining parts of this paper are organized as follows: After this introduction. Section 2 illustrates the performance evaluation techniques and introduces our used methodology. Section 3 presents the HW platform used. Section 4 determines the performance measurement results of the TMS320F Finally, section 5 summarizes the paper. II. PERFORMANCE EVALUATION TECHNIQUE The performance analysis of embedded systems has multiple aspects depending on the application that the system is made to. It will always be a true challenge for designer of this kind of systems, especially for DSC designers. Performance evaluation help designer to answer the following question: Does a particular DSC platform is appropriate for our application? How fast is the processor? Is it performed for real-time application? What is the memory usage? etc. A. Historical Evaluation Technique In the past, analysis was performed by DSC venders such as Texas Instruments, Motorola and Analog Devices. The analysis approaches were chosen by the vender's platform. The first evaluation approach was computed by the number of operation per second [9]. Performance evaluation use a very simple metrics to describe processor performance based on MIPS (millions of instructions per second), MOPS (millions of operations per second) and MACS (Multi-accumulates per second). These metrics are misleading because of the various amounts of work performed by instructions. They become insignificants when RISC architectures appeared. Actually, many solutions to measure hardware performance are presented. The most part of solutions are based on benchmarking applications. B. Benchmark Evaluation Approach Benchmark approach is used by the Standard Performance Corporation in the popular SPEC benchmarks. Benchmarking is a widely recognized to performance evaluation. They are written in a high level programming language and measure the performance of both compiler and processor. About 25 years ago, we didn't have the authorities of DSP benchmarking. Benchmarking was conduct almost only by chip vendors themselves. Nowadays, several (open source) benchmarks are used: Mibench [13], the most popular, Paranoia [14], LINPACK [15], etc. We can also find commercial benchmarking solutions more efficient like: SPEC (Standard Performance Evaluation Corporation) [16], EEMBC (Embedded Microprocessor Benchmark Consortium) [17], designed for embedded systems or EDN's DSP benchmark. Since the early beginning of computer and engineering, benchmarking has been playing an exceptionally wide variety of extraordinary important roles, greatly influencing major HW/SW concepts. Recently, benchmarking attracted an exceptionally high attention in both research and industrial CAD communities. The best benchmark is the application itself. However, in most cases we want a performance estimation of the end product at the initial phase of project. Benchmarks can be divided into three categories depending on the application. They have been intended for (i.e. control benchmarks, computation benchmarks and I/O benchmarks). Since it's very difficult to fit existing benchmarks solely into one category, it is a better idea to take a combination of these criteria (such as control-computation benchmarks and I/O benchmarks). Another useful way to categorize benchmarks is whether they are synthetic or application based. Three benchmark types for Microprocessor /MCU / DSP are used [18]: 1) Synthetic Benchmarks: developed to measure system specific parameters. Synthetic benchmarks are created with the intention to measure one or more features of systems, processors, or compilers. It try to mimic instruction mixes in real word applications. However, it is not related to how that feature will perform in a real application. 2) Application Based Benchmarks or "real world" benchmarks: developed to compare different processors architectures in the same fields of applications. Application based or "real world" benchmarks use the code drawn from real algorithms and they are more common in system-level benchmarking requirements. 3) Algorithm Based Benchmarks: (a compromise between the first and the second type) developed to compare systems architectures in special (synthetic) fields of application. The optimal benchmark program for a specific application is the one who is written in a high-level language, portable across different machines, and easily measurable as well as having a wide distribution. III. BENCHMARK PROGRAM SELECTION AND SPECIFICATION In our work we have chosen to adopt freely available benchmark solutions. In the first time, we used Synthetic Benchmarks based on the two complementary benchmarks : Dhrystone which report the integer performance of the architecture in Dhrystone MIPS and Whetstone which computes different algorithms and report the characteristics of the floating point units in whetstone MIPS. In the second time, we implement the Algorithm Based Benchmarks that perform mathematical operations with a basic fixed-point/floatingpoint computation. A. Dhrystone Benchmark Dhrystone [19] is a synthetic computation benchmark program developed in 1984 by Reinhold P. Weicker in ADA and translated to C by Rick Richardson. It is intended to be representative of integer performance. Dhrystone grow to become representative of general processor performance until it was outdated from Standard Performance Evaluation Cooperation. The recent version 2.1

47 of this benchmark is constituted by 103 high level statements within the main loop, which executes repeatedly during the benchmark execution. User can choose the number of iterations. As result, Dhrystone prints the absolutely time required per iterations through the loop, the performance measured in number of Dhrystone per second (the number of iterations of the main code loop per second). B. Whetstone Benchmark Whetstone benchmark [20] is a synthetic benchmark written in 1972 at the National Physical Laboratory in the United Kingdom. It was the first intentionally written benchmark ware to measure processors performance. It originally measured computing power in units of kilo- Whetstone Instructions Per Second (kwips). This was later changed to Millions of Whetstone Instructions Per Second (MWIPS). Both Dhrystone and Whetstone are synthetic benchmark, meaning that they are simple programs that are carefully designed to statistically mimic the processor usage of some set of programs. It difficult stems for the fact that one benchmark cannot effectively represent the variety of embedded applications. IV. OVERVIEW OF THE DSC HW/SW PLATFORM A. Hw Platform Texas Instruments is one of the leader company producing DSCs. Depending on applications, three DSCs families are used[6]: C2000 family is efficient for real-time control applications, C5000 family focuses on mobile system and lastly, C6000 family used for audio, image processing and communication applications. In our study, we choose to evaluate the performance of the C2000 family competent for real-time control application. The selected TMS320F28335 DSC is one of the cutting-edge floating-point DSCs in this series. It operates at 150MHz. Fig. 1 describes the functional block diagram of the TMS320F28335 DSC. C. Algorithm Based Program Algorithm Based Program presents a set of simple programs executed to evaluate the DSC processor architecture. They can be separated into different programs: 1) The Coordinate Rotation Digital Computer (CORDIC): invented by Jack Volder in 1959 [21, 22], is a simple program designed to estimate the basic elementary functions like trigonometric functions, square roots and exponential functions. Designers use CORDIC in practical all the applications : Biomedical applications to compute Fast Fourier Transforms (FFTs), robotics to determine the position and the movement of robotics joints and limbs, signal processing to generate sine and cosine waves, image processing to implement lighting and vector rotation and controls applications for asynchronous machine. CORDIC encloses two modes: The "Rotation" mode and the "Vectoring" mode : In the Rotation mode, input vector is rotated by a specified angle to compute sin and cosine, while in vectoring mode, the program rotates the input vector to the x axis to record the angle of rotation required to compute : (/ ). 2) FIR (Finite Impulse Response Filter): Filtering two uses [23]: Signal separation and signal restoration. In DSC, the digital filters are classified into FIR and IIR (Infinite Impulse Response). In our paper, we selected to benchmark the FIR filter which requires multiply-and-accumulate (MAC) operations to compute output from a 17-coefficients tap using simulated ADC input data. It has is implemented in practically all digital signal and image processing field such as the measurement of the electrical activity of a baby's heart (ECG signal). On the following sections, we will present our platform. Fig. 1 TMS320F28335 Functional Block Diagram [6] As mentioned in the Table 2, TMS320F28335 is fitted with a large memory capacity of 512KB on-chip flash, 2KB OPT ROM and 68KB asynchronous SRAM memory that sufficient

48 to storage program. It is also, equipped with 12-bit ADC, an RS232 interface, a CAN interface, etc. TABLE II THE MAIN FEATURES OF TMS320F28335 DSC Feature Architecture Frequency Cycle Time Clock and System Control Memory-On-Chip Peripherals Description 32 bits Harvard Bus Architecture 150Mhz 6.61 ns Dynamic PLL Ratio Supported, On- Chip Oscillator, Watchdog Timer module, Three 32 CPU Timers. 512KB Flash, 68KB SARAM, 2KB OPT ROM SPI, I2C, 12-bit ADCs, Internal oscillator, McBSP module, PWM module, watchdog, DMA, RS232, UART and ecan Fig. 2 Hw implementation of Benchmarks programs on TMS320F28335 DSC B. SW Platform There are two methods types to implement SW algorithms on the TMS320F28335 DSC [7]: 1) Using HLS (High Level Synthesis Approach): The design is developed with MATLAB/Simulink platform and the program can be directly downloaded into the DSC. 2) Using the CCS IDE (Integrated Development Environment Code Composer Studio) tool: CCS IDE offers an excellent framework for building and implementing programs written in C language and becoming a standard framework used by many embedded software vendors. It combines the advantages of the Eclipse software with advanced embedded debug capabilities from Texas Instruments resulting in a compelling feature rich development environment for embedded designers. To implement benchmark program, we used the CCS IDE tool which permits to develop and debug embedded applications. It includes source code editor, compilers for each of Texas Instrument's device families, project build environment, debugger, profiler, simulators and many other features. V. PERFORMANCE MEASURES A. Experimental Setup The implementation framework for benchmarks programs consist of a : TMS320F28335 DSC, a logic analyzer and a computer as the host. The hardware block diagram of the benchmarks implementation is shown in the Fig. 2. The benchmark code is required to be written in the CCS. Then, the code is also compiled, linked, downloaded and executed on the processor. After downloaded the executable code on the processor, the code runs wholly on the DSC. Processor performance can be measured in many ways. The most common metric is the time required for a processor to accomplish defined task. Some architecture use internal CPU clock driver. The total execution time for the code is the clock driver multiplied by the total instruction cycle count. This clock divided is not reflected in the total instruction cycle count number presented. In our case, Time is measured using an internal Timer, which operates at 150 MHz that triggers an interrupt every 1µs, and a Logic Analyzer, which measure value in order of nanosecond, to have a high precision measurement. In the next section, we will present results for benchmarks performance analysis. VI. BENCHMARKS RESULTS ANALYSIS Clearly each benchmark can only be compared to itself, as the resulting values are meaningless outside of that benchmarks context. The executed time for each benchmarks is very small, so a number of loops where used to get a time in microsecond range. A. Dhrystone Benchmark results Dhrystone benchmark is used to measure the performance of processors in handling pointers, structures and string. It is dominated by simple integer arithmetic, string operations, logic decisions, and memory accesses intended to reflect the processors activities in most general purpose computing applications. Results of the Dhrystone benchmark are based on the speed time: The number of microseconds that Dhrystone program takes to run. To evaluate TMS320F28335 DSC performance, we choose a Loop equal to 100. Dhrystone MIPS (DMIPS) is calculated using the following formulas:

49 DMIPS (Loop /Run_Time/1757 Where: Run_Time : The time spent to run Dhrystone benchmark Loop : The Loop used 1757 : The number of Dhrystones per Second obtained on the VAX 11/780 (Virtual Adress extention), nominally a 1 MIPS machine. It is interesting to compute the Dhrystone score as a function of the DSC frequency to show the effectiveness of the DSC core rather than how fast it can run. DMIPS/MHz is computed using the following formulas. DMIPS/MHz DMIPS/Frequency Of DSC in MHz Table 3 report performances resulting from the execution of Dhrystone benchmarks using 8 bits and 16 bits precisions under the TMS320F28335 DSC platform. number) and double precision using double (64 bits precision number). 2,50E+04 2,00E+04 1,50E+04 1,00E+04 5,00E+03 TABLE III RESULTS OBTAINED USING DHRYSTONE BENCHMARK 0,00E+00 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 8 Bits Precision 16 Bits Precision Run_Time (µs) DMIPS DMIPS/MHz The use of the 8 bits and the 16 bits precision have not effect on the performance (MIPS) results since the Dhrystone benchmark does not use huge values. It is dominated by single integer arithmetic, string operations, logic decisions, and memory accesses intended to reflect the CPU activities in computing applications. B. Whetstone Benchmark results Whetstone benchmark attempts to measure the performance of both fixed-point and floating-point arithmetic in a variety of scientific functions. These functions are divided into modules: M1 : Computation with simple identifiers. M2 : Computation with array elements. M3 : Passing an array as parameter. M4 : Performing conditional Jump. M5 : Performing integer arithmetic. M6 : Computation of Trigonometric Functions. M7 : Procedures Call. M8 : Array reference and Procedure Call. M9 : Integer Arithmetic. M10: Computations Standard Functions. To evaluate the speed (Run_Time) using whetstone benchmark in microsecond, we have to use, for one iteration, a Loop equal to 10. The obtained results of this benchmark are summarized on the Fig. 3 presents the Run_Time (µs) of each whetstone module using the two kinds of floating-point precision: Single precision using float (32 bits precision Single Precision (µs) Double Precision (µs) Fig. 3 Results obtained using the two data precision for whetstone benchmark modules These results prove that the execution time using integer modules (1-5-9) and floating-point units modules (2-6) are rapidly executed comparing to the call procedures modules which require more than 50% of the whole execution time. The number of Whetstone Instruction per second (WIPS) can be measured for all Whetstone benchmarks. It is calculated as follows: WIPS Loop/Run_Time Where: Loop : The Loop used Run_Time : The time spent to run benchmark Table 4 presents the performance of the TMS320F28335 DSC on Kilo-Whetstone Instructions Per Second (KWIPS). TABLE IV RESULTS OBTAINED USING THE TWO DATA PRECISION FOR WHETSTONE BENCHMARK Single Precision KWIPS Double Precision KWIPS Whetstone calculated Whetstone measured C. Algorithms Application Benchmarks results Algorithms Based Benchmarks use simple programs. Each program is a unique code for testing special parts of the architecture. FIR filter and CORDIC programs, are widely

50 used in the DSCs field. Performance analysis using these Algorithms is based on the time required to achieve each defined task using a Loop equal to Table 5 depicts execution results of these Applications benchmarks. TABLE V RESULTS OBTAINED USING THE TWO DATA PRECISION FOR ALGORITHMS APPLICATIONS BENCHMARK CORDIC "Rotation" Mode Transcendental sine and cosine functions CORDIC "Vectoring" Mode Transcendental tang -1 function Single Precision Double Precision Run_Time (µs) Run_Time (µs) FIR filter CORDIC algorithm is not a very fast algorithm for use compared to the transcendental mathematical functions. It is followed due to its very simple implementation based on simples shift- add operations. So, trigonometric functions should be computed using transcendental mathematical functions. Results of Whetstone benchmark and Applications Based Benchmarks, CORDIC and FIR filter, indicate that the minimum execution time is provided using double precision. The double precision is actually faster than the single precision for the processors optimized for high-speed mathematical calculations (DSCs, DSPs) using transcendental mathematical functions which return a double values. These results prove the efficiently of the DSC processors on the signal processing applications. VII. CONCLUSIONS The reported benchmarks results cover three complementary benchmarks using single and double data precision. Dhrystone is used to compute integer unit performance. Whetstone is able to characterize floating-point operations. Algorithm Application Benchmarks are used to measure the calculus capacity of processor. The same approach can be used to analysis other embedded systems or other architectures. In our work, we focused on the hardware excursion speed evaluation in the embedded system design flow. We measured the CPU performance of the TMS320F28335 DSC in term of execution time without using optimization. It's very interesting to compute the effect of optimization techniques of the compiler on execution time. This work can be extended by evaluating the TMS320F28335 DSC performance comparing to other architectures such as FPGAs. REFERENCES [1] S. Mitra, "When MCUs and DSPs Collide: Digital Signal Controllers", Microchip, available : [2] K. Illgner, H. GrubeF, P. Gelaberf, J. Liang, Y. Yoo, W. Rabadiz and R. Talluri, "PROGRAMMABLE DSP PLATFORM FOR DIGITAL STILL CAMERAS", Acoustics, Speech, and Signal Processing, Proceedings., IEEE International Conference, vol. 4, pp , Mar [3] C. Buccella, H. C. Cecati, and H. Katafat, "Digital Control of Power Converters-A Survey", IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, vol. 8, pp , Aug [4] S. A. Mir, M. E. Elbuluk, and D. S. Zinger, "Fuzzy Implementation of Direct Self Control of Induction Machines", IEEE TRANSACTIONS ON INDUSTRY APPLICATIONS, vol.30, pp , 1994 [5] P. Dobra, R. Duma, D. Moga, and M. Trusca, "Digital Control Applications using TI Digital Signal Controller", WSEAS TRANSACTIONS ON SYSTEMS AND CONTROL, Issue 6, vol.3, pp , June 2008 [6] Texas Instrument Datasheets and Tutorial, (2013), Available : [7] C2000 MCU Teaching ROM, "TMS320F2833X Digital Signal Processor Implementation Tutorial", Texas Instrument, 2010 [8] Y. L. Goh, A. K. Ramasamy, F. H. Nagi, A. Azwin, and Z. Abidin, "Evaluation of DSP based Numerical Relay for Overcurrent Protection", INTERNATIONAL JOURNAL OF SYSTEMS APPLICATIONS, ENGINEERING & DEVELOPMENT, Issue 3, vol. 5, ppv , 2011 [9] N. Litayem, B. Jaafer and S. B. Saoud, "Embedded Microprocessor Evaluation- A case study of theleon3 Processor", Tecnia Journal of Manajment Studies, vol. 6, pp , 2011 [10] Groâschädl J., Tillich S., Szekely, "Performance Evaluation of Instruction Set extensions for Long Integer Modular Arithmetic on a SPARC V8 Processor", IEEE DSD [11] Kurian L., Lieven Eeckhout, "Performance evaluation and Benchmarking", CRC Press, ISBN , [12] [12] Wolf W. (2007). "High-performance embedded computing", ISBN 13: , Elsevier. [13] M. R. Guthaus, J. S. Ringenberg and D. Ernst, "MiBench M. R.. A free, commercially representative embedded benchmark suite". In Proc. of 4th Annual IEEE Workshop on Workload Characterization, 2001 [14] R. KARPINSKI, "Paranoia: A floating-point benchmark". Byte Magazine 10, pp , 1985 [15] J.J. Dongarra, J.R. Bunch, C.B. Moler and G.W. Stewart. "LINPACK Users Guide", SIAM Pub, Philadelphia, PA, [16] Henning J. L., "SPEC CPU2000: measuring cpu performance in the new millennium, IEEE Computer", vol.7, pp.28 35, July, [17] EDN Embedded Microprocessor Benchmark Consortium, available : [18] Berkeley Design Technology Inc., "Evaluating DSP Processor Performance", [19] G. Stitt, R. Lysecky and F. Vahid, "Dynamic Hardware/Software Partitioning: A First Approach", in Proc. of the 40th Design Automation Conference, pp , 2003 [20] H. J. Curnow and B. A. Wichmann, "Whetstone benchmark: A synthetic benchmark", The Computer Journal, vol. 19, pp [21] Manoj Arora, R. S. Chauhan, Lalit Bagga, "FPGA Prototyping of Hardware Implementation of CORDIC Algorithm", International Journal of Scientific & Engineering Research, vol. 3, pp. 1-6, January [22] T. Menakadevi and M. Madheswaran, "Direct Digital Synthesizer using Pipelined CORDIC Algorithm for Software Defined Radio", International Journal of Science and Technology, vol. 2, June 2012 [23] Application Report Texas Instrument, "MSP430 Competitive Benchmarking", January, 2009.

51 Designing and building embedded environment for robotic control application Nabil LITAYEM, Meftah GHRISSI, Ahmed Karim BEN SALEM, Slim BEN SAOUD LECAP-EPT-INSAT BP 676, 1080 Tunis Cedex, Tunisia Abstract- Embedded robotics controller is one of the most complex embedded systems. The specific hardware components required by each controller and the real time constraint lead to increase the complexity to design this kind of system. The availability of FPGA circuits and its associated hardware library combined with embedded Linux can improve the time to market, the stability and the power consumption of such controller. This paper describes the design and building of an embedded controller for robotic application based on the Xilinx XUP board using MontaVista Real-Time Linux and OROCOS library. We present the motivations of this work, the different phases encountered to release it, and the justification of each hardware or software choice. Finally, we discuss the strength and weaknesses of the proposed solution. Index Terms Real time systems, Mobile robots, Embedded Linux, FPGA, control application. I. INTRODUCTION Embedded robotic controllers are by nature complex systems designed for very specific applications. Designing this kind of system can require very specific skills related to digital design and real time systems. The design reuse in this kind of system is especially hard task due to the specificity of every robotic solution. The emergency of FPGAs (Field Programmable Gate Array) and its related tools improve the re-usability and portability. Combined with modern standard operating system, it can be a very interesting solution which offers the flexibility of FPGA and standards interfaces of those operating systems[1]. The scope of this work is to demonstrate the feasibility, strength and weaknesses of this kind of solution. II. A. Motivation of this work PROPOSED DESIGN FOR ROBOT CONTROLLER The design of our robot controller is initiated by SRD2I company which developed Hercule [2] robot used for mineral water and soft drinks transportation. The main limit of its old design is the portability of the control system since all the control system is made around a Freescale MPC555 microcontroller and all the software components are homes made. We ll try in this work to redesign the control systems with two goals. The first one is to be independent of any hardware or software manufacturer without loosing any functionality of the old robot versions. The second one is to increase the evolutivity and quality of the control system by using powerful standard robotic library instead home made functions. B. Proposed design In order to achieve the presented goals we propose the design illustrated by Fig.1. The use of FPGA as HW platform instead of MPC555 micro-controller can increase the evolutivity of our solution since we can customize any hardware components of our design. Due to the standardization of Linux both in desktop and embedded environment, its use can lead to improve the portability and the development facility of our solution since we can entirely prototype our embedded software in desktop environment, this choice have been also motivated by many successful use of embedded Linux in robotic applications [3][4]. Finally, the use of standard robotic libraries can offers a set of predefined functionality which decrease the time to market and increase the quality of the embedded software. Robotic Control Application Embedded Robotic Library Real Time Linux HW µp QEI DDR Controller Fig. 1. Embedded robot controller C. Presentation of the XUP-V2P Board PID Controller DDR Controller We have chosen to use as FPGA platform the XUP Virtex-II Pro development system shown in Fig.2. This board [5] provides an advanced hardware platform that consists on a high performance Virtex-II Pro Platform FPGA surrounded by a comprehensive collection of peripheral components that can be used to create a complex system and to demonstrate the capability of the Virtex-II Pro FPGA /09/$ IEEE 2907

52 1. ADEOS The ADEOS [10] project has created as a GPL hardware abstraction layer that allows a real-time kernel and a general purpose kernel to co-exist, which will support the kinds of dual-kernel hard real-time Linux environments that have been previously using RTLinux or RTAI, but without making use of the technology that is the subject of a patent held by the originator of RTLinux. 2. ART Linux Fig.2. Hardware details of the XUP V2P board [4] III. REAL TIME EMBEDDED LINUX A. Why linux? As introduced in linuxdevice [6], [7] reference site and illustrated by Figure.3, embedded Linux is the predominant OS in embedded market and its place growing years after years. ART-Linux [11] is an acronym of Advanced Real-Time Linux, it s a hard real-time kernel developed with robotics applications in mind. Real-Time is accessible from user level and does not require special device drivers 3. KURT The KU Real-Time Linux [12] is a real-time Linux implementation extension developed by the Kansas University. It allows scheduling of events with a 10µs resolution. 4. Linux/RK Linux/RK (Linux/Resource Kernel) [13] is a real time extension which incorporates real-time extensions to the Linux kernel. 5. QLinux A Linux kernel implementation that provides Quality of Service [14] (QoS) guarantees for "soft real-time" Linux performance in applications such as multimedia, data collection, etc. 6. RED-Linux A real-time version [15] of Linux that implements short kernel blocking time, quick task response time, a modularized and runtime replaceable CPU scheduler, and a general scheduling framework. 7. RTAI Fig.3. Embedded OS sourcing trends B. Real Time Linux Implementations Due to the increasing Linux popularity in embedded systems field, many efforts were made and proposed to transform Linux kernel into a real-time kernel. These works resulted in several implementations of real-time Linux. Actually there are many existing implementation of real-time extension for Linux kernel [8], [9]. These extensions can be classified in two categories according to the approach used to improve the real time performance of the Linux kernel. The first approach consists of modifying the kernel behavior to improve its real time characteristics. The second approach consists of using small real-time kernel to handle real time tasks and who can run the Linux kernel as low priority task. In the following section we ll present the different open source real time extension and the technique used on it to bring real time behavior in this Linux kernel. RTAI is a Real Time Application [16] Interface usable both for uni-processors and symmetric multi processors (SMPs), that allows the use of Linux in many "hard real-time" applications. As an option, RTAI's "LXRT" allows the control of real-time tasks, using all of RTAI's hard real-time system calls, from within Linux memory-protected user space resulting in soft real-time combined with fine-grained task scheduling. RTAI is the real time Linux that have the best integration with others open source tools [17] scilab/scicos, this extension is widely used in control applications [18]. 8. Xenomai Xenomai [19] is a real-time development framework that provides pervasive, interface-agnostic, hard real-time support to user-space applications, integrated into GNU/Linux. Based on Adeos, Xenomai was launched by Philippe Gerum, the implementers of ADEOS. Xenomai provides real-time interfaces either to kernel-space modules or to user-space applications. Interfaces include RTOS interfaces (psos+, VRTX, VxWorks, and RTAI), standardized interfaces (POSIX, uitron), or new interfaces designed with the help of RTAI (native interface). It was designed for enabling to /09/$ IEEE 2908

53 smooth migration from traditional RTOS to Linux without having to rewrite entire application. 9. RT-Preempt The RT-Preempt patch [20] converts Linux into a fully preemptible kernel. It allows nearly the entire kernel to be preempted, with the exception of a few very small regions of code. This is done by replacing most kernel spinlocks with mutexes that support priority inheritance and are preemptive, as well as moving all interrupts to kernel threads. (Dubbed interrupt threading), which by giving them their own context allows them to sleep among other things. 10. Montavista opensource kernel Offered by Monta Vista Company, this kernel includes all the benefits of the Monta Vista commercial kernel (real time capability, driver support ). This open source version is hosted by MontaVsita [21] but it didn t warrant or support any software on its public site C. Choice of real time extension for robotic controller The choice of one among existing real time solution must take in account the offered functionality [22], the existence of Xilinx device driver in the chosen kernel and the quality of support related to this kernel. Based on this consideration and others related works [23], [24] we have four serious candidates which are RTAI, Xenomai, RT-Preempt and MontaVista Linux. Xenomai is the most powerful in term of functionality and extensibility but it hase luck of some Xilinx driver since it s based on Vanilla Kernel, we had tried to port the Xenomai patch to the Linux kernel offered by Xilinx but without success. RTAI had been rejected for the same problem of Xenomai since it s also based on Vanilla kernel. RT-Preempt can be with the most recent kernels but it can t offer hard real time and it s not officially supported by Xilinx tools, this extension is included in the 2.6 Linux kernel hosted in the Xilinx git. The MontaVista real time Linux is based on 2.4 kernel and it s officially supported by Xilinx tools, MontaVista Linux also offer the most of standard Xilinx device. We had chosen to use this kernel as RTOS of our robotic controller. This choice is made without any quantitative consideration. IV. A. Robotics software platforms ROBOTIC LIBRARY A robotic software platform [25] is a software package which simplifies programming of several kinds of robotic devices by providing: a unified programming environment; a unified service execution environment; a set of reusable components; a debugging/simulation environment; a package of drivers for most wide-spread robotics hardware; a package of common facilities such as computer vision, navigation or robotic arm control. There are several commercial and open source robotics software platforms. The most used are: Microsoft Robotics Studio, Mobile Robots, Skillgen, irobot AWARE, Gostai Urbi, Evolution Robotics ERSP, OROCOS [26] and Player/Stage/Gazebo [27]. According to our initial design our robot must run under Linux operating system, for this reason we reject Microsoft Robotics Studio from the selected for studying libraries. Table.1 shows the main functionality of each robotics platform able to run under Linux environment. Open Source Free of Charge Windows TABLE I MAIN PROPERTIES OF ROBOTICS SOFTWARE PLATFORMS [25] Mobile Robots Skilligent irobot AWARE 2.0 Gostai Urbi Evolution Robotics ERSP 3.1 OROCOS Player, Stage, Gazebo No No No Partial No Yes Yes No No No Selected platforms No Yes Yes Yes Yes??? Yes Yes No Yes (simul only) Linux Yes Yes Yes Yes Yes Yes Yes Distributed Services Architecture Fault- Tolerance JAUS Compliant Graphical OCU Graphical Drag-n-Drop IDE Built-in Robotic Arm Control Built-in Visual Object Recognition Built-in Localization System Robot Learning and Social Interaction Simulation Environment Reusable Service Building Blocks Real-Time No Yes Yes Yes No No Yes (limited) No Yes Yes No No No No No Yes Yes No No No No (???) Yes Yes Yes Yes Yes No No No No No Yes Yes No No Yes Yes Yes No No Yes No No Yes No No Yes No No Yes Yes No (pieces only) No Yes No No No Yes No No No No No Yes No Yes Yes (Webots) No No Yes Yes Yes Yes Yes Yes Yes No No No (Planned) No No No Yes No B. Choosing the robotics software platform for our robotic controller To choose one of the presented earlier software, we must take in account the characteristics of our robot controller, the open source license and free of charge of the chosen platform can be a good advantage for our robot in term of price and software /09/$ IEEE 2909

54 quality. The reusable service building blocks and Real-Time capability is the most important features needed by our strategy. Fault tolerance and simulation environment is also an import selection criteria but it s outside our actual needs. After studying the earlier presented software platform, we conclude that Skilligent is the most powerful. But due to the open source license and Real Time capability, OROCOS can respect nearly our needs. V. TOOLCHAIN COMPILER FOR PPC_405 To build the chosen kernel, the robotic library and the control application, we must have a powerful compiler toolchain, this kind of work is common need during an embedded project development. This tool-set is known as a cross development toolchain. To achieve this goal, we studied and tried different available commercial/free pre-compiled cross compiler and different tools to build from scratch a cross compiler tool chain. The toolchain is made up of several packages: the kernel headers, the binutils package that contains the assembler, the linker and binary handling tools the glibc package, containing among other things the standard C library used by all programs the gcc (Gnu Compiler Collection) package, at least to provide the C compiler Creating a cross development toolchain from sources can be a real pain as these components have cross-dependencies (best known as the "chicken and egg" problem). There is version dependency issues, patches required to make something work, etc. Fortunately, there are several scripts to create a toolchain, or directly downloadable binary toolchain. A. Main cross compiler toolchain 1. DENX ELDK The DENX [28] Embedded Linux Development Kit (ELDK) provides a complete and powerful software development environment for embedded and real-time systems. It is available for ARM, PowerPC and MIPS. All components of the ELDK are available for free with complete source code under GPL and other Free Software Licenses. Also, detailed instructions to rebuild all the tools and packages from scratch are included. 2. Buildroot Buildroot [29] is a complete build system based on the Linux Kernel configuration system and supports a wide range of target architectures. It generates root file system images ready to be written to flash. In addition to having a huge number of packages which can be compiled into the image, it also generates a cross toolchain to build those packages from source. Even if you don't want to use buildroot for your root file system, it is a useful tool for generating a toolchain. It should be noted however, that it only supports uclibc. If you want to use glibc, you'll need something else. 3. Scratchbox Scratchbox [29] provides toolchains for ARM and x86 target architectures (with PowerPC, MIPS and CRIS in experimental stages). Both uclibc & glibc are supported. Scratchbox simplifies cross compiling software that is built using GNU autotools - Code tests performed by configure are run in an emulator or even on the actual target. The toolchains scratchbox ships with are based on gcc 3.3 who is such are quite old, but stable and well tested. It should be pointed out that scripts to build custom toolchains are also provided with scratchbox allowing more recent gcc versions to be used. 4. Crossdev Crossdev [31] is specific to developers using Gentoo for their development PCs. It is a script which generates a cross toolchain using the portage build scripts for gcc, etc. There are numerous architectures which are supported and both uclibc and glibc toolchains can be built. 5. Crosstool Crosstool [32] is a script which downloads source tar-balls and builds simple gcc/glibc cross toolchains. There is a build matrix which shows which versions of gcc/glibc work together with various architectures. The inclusion of this matrix makes it easy to select which versions of gcc/glibc should be used to generate a toolchain for a particular architecture. 6. Crosstool-NG Crosstool-NG [33] is a fork of crosstool, targeted at easier configuration, re-factored code and a learning base on how toolchains are built, with support for both uclibc and glibc, for debug tools (gdb, strace, dmalloc...) and a wide range of versions for each tool. Different target architectures are supported as well. 7. CodeSourcery CodeSourcery [34] develops Sourcery G++, an Eclipse based Integrated Development Environment (IDE) that incorporates the GNU Toolchain (gcc, gdb, etc.) for cross development for numerous target architectures. CodeSourcery provides a "lite" version for ARM, Coldfire, MIPS and Power architectures. The toolchains are always very up-to-date. CodeSourcery contributes enhancements it makes to the GNU Toolchain upstream continually. 8. Embedded Debian cross-tools packages Embedded Debian cross-tools are the toolchain of Embedian project. It consists of several prebuilt toolchains to build for arm, ia64, m68k, mips, mipsel, powerpc and sparc using gcc- 3.3, gcc-3.4, gcc-4.0, gcc-4.1 and gcc-4.2. B. Selecting cross compiler for building our kernel, library and application To choose one cross compiler chain, we must take in account the degree of complexity to use/build this tool chain, its compatibility with our working environment, its ability to /09/$ IEEE 2910

55 build our chosen embedded software infrastructure and its eventual cost. At the start of development, we choose ELDK think to its great set of functionality and its support of PowerPC architecture. We have also chosen ELDK because it s come as an ISO image of disk including all the prebuilt tools needed to start the development. When we try to build Montavista Linux using ELDK 4.1 who include gcc 4.1, we receive many error messages. After studying these errors, we concluded thanks to the report of many other users that these errors are due to an incompatibility between gcc 4.1 and 2.4 Linux kernel. This problem was solved by using oldest version of ELDK which include gcc 3.4 or by using crosstool to generate gcc 3.4 or 3.3 based toolchain. Our experience with buildroot shows us that this tool can build successfully gcc 4.1 for PPC_405 but can t build older versions, croosstool-ng doesn t have a good support for PPC_405. We must also note that both crosstool and old ELDK can build correctly Montavista under SUSE Linux 10 or Slakware 10 but not under recent Ubuntu distribution since the cross compiler have some problems with recent tools included in it. VI. A. Hardware Infrastructure BUILDING HARDWARE INFRASTRUCTURE To meet the requirements of our robot controller we propose the design illustrated by figure.4. This design is based on two PPC_405 processor associated in shared memory mode, standard component available in EDK library combined with quadratic encoder interface (CIN) and hardware PID module from robot control library available in opencores site. The CIN is used to measure the speed of each wheel of the robot, the PID hardware module can be use to accelerate the PID computation. PPC_405 0 PPC_405 1 PLB Bus PLB2OPLB Bridge PLB BRAM Controller PLB DDR Controller PLB BRAM DDR 256 MB Figure.4: Hardware design of robot controller B. Building the hardware infrastructure OPB Bus UART CIN EDK is a very interesting IDE for building both hardware and software on Xilinx FPGA s platform. It was used to generate our hardware platform. This generation can be considered in two steps. Firstly, the base system generation that is the core system without any custom IP, this step was conducted using BSB (Base system Builder) wizard of EDK, the result of this PID step is hardware configuration composed of two PPC_405 processor rated at 300 MHz, OPB_UART16550, OPB_ETHERNET, OPB_ETHERNET, PLB_DDR, 128 Ko BRAM block. Secondly, we must proceed to external IP integration to add CIN and PID IPs. The generation of MontaVista Linux BSP is natively supported by EDK. We must just configure Software Platform settings by indicating the required information. We must also note that EDK since the version 10.1 Service Pack 3 will not support the XUP V2P board since this board is considered a mature product. VII. BUILDING LINUX KERNEL AND GENERATIN ACE FILE A. Building the Linux kernel To build our Linux kernel we used generated crosstool cross compiler under SUSE Linux 10. In the configuration phase we must correctly choose the used devices in hardware configuration. The compiled kernel can be used later to generate the ACE file that can be used to load both hardware and software configuration in the FPGA circuit. B. Generating the ACE file SystemACE [35] provides many benefits for the hardware and/or software designer. The primary input into SystemACE is an ACE file. It contains a hardware BIT file and/or software ELF file. System ACE has a built-in Boundary Scan (JTAG) interface for external tools testing and programming. Using the JTAG interface, SystemACE can configure the FPGA and using the embedded processor, configures internal or external memory with the software ELF file. After building booth hardware (system.bit) and software (zimage.elf), we must merge it in a single ACE file. For doing this work, we must customize genace.tcl file included in EDK to take in account the characteristics of the XUP board. C. The rest of the job To reach the ready to run Linux in FPGA we must partition our compact flash to having at least 3 partitions, one in FAT16 file system for ACE file, one in SWAP file system for swapping and the last one for root file system. For the root file system, we have chosen to use that from Virtex-Linux [36] distribution and customize it with ELDK_MAKEDEV and other components from ELDK distribution. VIII. DESKTOP DEVELOPMENT ENVIRONEMT As illustrated earlier the fact of using Linux as embedded RTOS combined with standard open source robotic library offer the opportunity to develop and test the most of our control solution in desktop environment. Actually we had prepared a development environment based on Ubuntu desktop edition within it we had built and test OROCOS library and its samples. The fact of running all the control application in native desktop environment offer the opportunity to build the entirely control application in convivial desktop environment and using /09/$ IEEE 2911

56 standard tools, this can contribute to reduce the time to market and improve the quality of control software. CONCLUSION AND PERSPECTIVES Using the design and approach presented along this paper we can build a platform that can be used in robotics control applications. Combining the powerful of Xilinx FPGA with standardization of the Linux kernel and the OROCOS library, offer our controller a high level of abstraction and evolutivity. It also offers the possibility to develop and test the most of our software in standard Linux Desktop environment. The chose of MontaVista Linux 2.4 as RTOS offer the most of standard Xilinx drivers, but may be replaced by an up-to-date kernel since the actual kernel is 2.6 and it have 5 years old, we must also note that Xilinx maintain a Git of the Linux 2.6 kernel that include the most of Xilinx driver and the support the most of the Xilinx Board. Our embedded robotic platform can be improved by different approaches that are subjects of our future works such as porting Xenomai to the platform for offering the benefit of these real-time Linux extension notably RTOS emulation and kernel 2.6 family support. This kind of work consists of merging the Xenomai patch with the Linux 2.6 kernel from the Xilinx Git. Secondly we plan to integrate the totality of the Robot Control library from Opencores [37], this library can offer many interesting hardware modules that can be used in robotics control applications (QEI, PID, PWM, Stepper Control ). [18] M. Chiandone, S. Cleva, R. Menis and G. Sulligoi, Industrial Motion Control Applications using Linux RTAI, SPEEDAM 2008 International Symposium on Power Electronics, Electrical Drives, Automation and Motion [19] [20] [21] [22] M. Tim Jones, Anatomy of real-time Linux architectures From soft to hard real-time, IBM developer work 15 Apr [23] Zujue Chen, Xing Luo, Zhixiong Zhang, Research Reform on Embedded Linux s Hard Real-time Capability in Application, The 2008 International Conference on Embedded Software and Systems Symposia (ICESS2008). [24] Shouyin Lu, Liqiang Feng, Jiwen Dong, Design of control system for substation inspection robot based on real time Linux, 2008 Chinese Control and Decision Conference (CCDC 2008) [25] [26] Herman Bruyninckx, Peter Soetens, Bob Koninckx, The Real-Time Motion Control Core of the Orocos Project, Proceedings of the 1003 IEEE International Conference on Robotics & Automation Taipei, Taiwan, September 14-19, 2003 [27] Koenig, N, Howard, A, Design and use paradigms for Gazebo, an open-source multi-robot simulator, Intelligent Robots and Systems, (IROS 2004). Proceedings IEEE/RSJ. [28] [29] [30] [31] [32] [33] [34] [35] ftp://ftp.xilinx.com/pub/documentation/misc/system_ace.pdf [36] [37] REFERENCES [1] Tyson S. Hall, James O. Hamblen, Using an FPGA Processor Core and Embedded Linux for Senior Design Projects, 2007 IEEE International Conference on Microelectronic Systems Education (MSE'07). [2] S. Ben Saoud, L. Nciri, and M. Ghrissi, Path-Tracking and Parking Manoeuvre Control of an Industrial Tricycle Robot, International Journal of Robotics and Automation [3] Alok Rao, Satish Kumar, Amit Benu & G. C. Nandi., «MILO-Mobile Intelligent Linux Robot, IEEE India Annual Congerence 2004, Indicon [4] Sonia Thakur, James M. Conrad, An Embedded Linux Based Navigation System for an Autonomous Underwater Vehicle, SoutheastCon, Proceedings. IEEE. [5] [6] 2008 Embedded Linux Market Survey, [7] Snapshot of the embedded Linux market, April 2007, [8] A. Barbalace, A. Luchetta, G. Manduchi, M. Moro, A. Soppelsa, and C. Taliercio Performance Comparison of VxWorks, Linux, RTAI, and Xenomai in a Hard Real-Time Application, IEEE Transactions On Nuclear Science, Vol. 55, No. 1, February [9] N.Vun, H.F.Hor and J.W.Chao, Real-time Enhancement for Embedded Linux, th IEEE International Conference on parallel and Distributed Systems. [10] Karim Yaghmour Opersys Inc, Adaptive Domain Environmentfor Operating Systems, [Online]. Available: [11] [12] [13] [14] [15] Kwei-Jay Lin, Yu-Chung Wang, RED-Linux Design, Proceedings of IEEE PIEEE 2003, Vol 91, No. 7, pp , July 2003 [16] [17] Roberto Bucher and Silvano Balemi, Scilab/Scicos and Linux RTAI - A unified approach, Proceedings of the,ieee Conference on Control Applications Toronto, Canada, August 28-31, /09/$ IEEE 2912

57 Servo Drive System Based on Programmable SoC Architecture Ahmed Karim Ben Salem, Slim Ben Othman, Slim Ben Saoud, Nabil Litayem LECAP- EPT-INSAT - B.P. 676, 1080 Tunis Cedex, Tunisia [email protected], [email protected], [email protected], [email protected] Abstract-Current deep submicron processing technologies enable integration of multiple software programmable processors and dedicated hardware components into a single integrated circuit, called System on Chip (SoC) that offers high performance and flexibility. Consequently, the nowadays motor control industry, to remain competitive, should develop high speed digital control systems based on these SoCs. In this paper, we tackle this new technology in the context of electrical motor control domain in two ways: we describe high performance FPGA reconfigurable hardware architectures and we give a modular and scalable embedded mono-processor software architecture, based on hard-core processor ensuring high performance and flexibility. We propose two RTOS supports (µc/os-ii and Xilkernel) for this control application to answer Real-Time (RT) constraints. Our top-down co-design methodology with various abstraction levels has helped us designing a high performance embedded control in a reasonable time. The experimental results show the effectiveness of our RT SoC approach. I. INTRODUCTION Nowadays complex and sophisticated digital motor-control applications exceed the capabilities of conventional solutions. Several solutions based on (FPGA) have been recently explored to design high speed digital controllers [1]. But new System on Chip (SoC) devices including powerful embedded processors linked with a rich-components environment are envisioned as the future of embedded control platforms. They could perform high-end computational requirements based on Hardware-Software (HW-SW) environment in order to enhance performance and maximise flexibility. Besides, the key issue in a SoC design is to trade-off efficiency against flexibility. Therefore, there are a lot of challenges regarding SoC design methodologies and architectures styles. The objective of this work is to explore the new features of SoC technology to design and implement an embedded servo drive system with high flexibility. So the aim is to obtain a high speed extensible architecture with a programmable control algorithm. All functions should be fully implemented in SW ensuring computational power. However, one of the main challenges of this HW-SW solution is the efficient mapping of this multitask application on an embedded processing core. So a co-design methodology applying HW- SW interfaces with Real-Time (RT) SW algorithms based on a RTOS has been used in order to manage complexity and shorten design time. The paper describes a programmable SoC architecture that can support any servo drive system. A simple DC motor case has been used as an application example to validate and evaluate the adopted architecture. The outline of this paper is as follow. Section II introduces the new SoC environments. Section III presents the new challenges in digital control systems and the benefits of HW- SW implementation on SoC. Section IV describes the adopted SoC architecture and details the HW-SW design and implementation. Section V exposes some experimental results for the implemented case study. Finally, in the last section VI, conclusion and future work are provided. II. SYSTEM ON CHIP REVOLUTION A. New System on Chip Solutions SoC architecture [2][3] allows integrating a full system in a single chip, avoiding external components and additionally reducing cost and complexity. This approach improves both performance and design time. Furthermore, SoC solutions offer more flexibility than other conventional digital solutions via their reconfigurable nature. The most popular SoC device is FPGA [4]. Nowadays FPGA technology containing both reconfigurable logic blocks and embedded cores [5] becomes quite mature for large-scale high speed applications. It offers high performance and flexibility via programmable SW design interacting with a reconfigurable HW one [6]. It allows accelerating embedded C applications and consequently ensuring higher degree of parallelism. Xilinx [7] embedded technology allows to bring embedded processors onto the FPGA, such as the MicroBlaze soft-core processor or the hard-coded PowerPC 405 processor. A hard-core processor is implemented in the FPGA at the transistor level so it has dedicated silicon on the FPGA. This allows it to operate with a core frequency and have a rating similar to that of a discrete microprocessor. A soft-core processor is a microprocessor fully described in software, usually in a VHDL (VHSIC Hardware Description Language), and capable to be synthesized in programmable hardware solution. Hence, this flexible processor will not operate at the speeds or have the performance of a hard-core. B. HW/SW Co-Design Tools The development and integration of a system in a new SoC device is no longer a pure HW design. It is today also a question of system configurability, system performance, highlevel SW implementation, and right platform choice. Therefore, HW/SW co-design tools for FPGA have developed extensively in the last years and are continuing to do so. They present new possibilities in co-design since they IEEE Preprint of IECON 2009 Proceedings

58 provide complete development environments with on-chip system verification, on-chip logic analysers, synthesis tools, compilers and debuggers [8]. Examples of these tools are CASTLE co-design platform [9], Visual spec tool [10], Xilinx Platform Studio [7], etc. Simplified FPGA-based computing platforms with the help of previous tools have made programming for such SoC practical and efficient. III. NEWS TRENDS IN DIGITAL CONTROL SYSTEMS A. New Challenges in Embedded Control Systems Today s industrial control products demand ever higher computing power, RT information processing and physical size limitations [11]. These often require a new high performance solution. Indeed, many servomotor applications such as high-speed automation systems and vehicle control systems require higher levels of timing performance than ever before. These motor-control applications require not just high speed performance but more flexibility as well to support a variety of closed control loop scenarios. Therefore, traditional digital controller systems using standard processors such as: µp, µc or Digital Signal Processor (DSP) [12][13] can no longer answer these new control applications requirements. New reconfigurable FPGA design suitable for high speed applications enables complex control law execution at extremely high sampling rate, being especially suitable for electrical embedded control systems. [14] and [15] have demonstrated how FPGA-based digital control properties are better than DSP ones and other digital devices for any comparative term. B. HW-SW Implementation of Servo-Drive Control Sophisticated control algorithm functions represent a complex code that can be divided into sub-tasks with different update rates depending on the available bandwidth and processing priority. These control algorithms need RT implementation; they need to be coded in an efficient manner and to be partitioned in HW and SW modules. Certain time critical tasks need to be implemented in HW while other functions that require much slower processing and a large amount of memory should be embedded in SW using processor cores. Motor control (Fig. 1), for example, requires both torque and speed control. Torque control requires high-performance computation but speed control, requires relatively moderate computation speeds. Software-Intensive Tasks - Lots of memory - Fast computation - No dedicated hardware Networking Sequencing PLC Host Communicati on Milliseconds Position control Hundreds of Microseconds Speed control Tens of Microseconds Torque control Microseconds Hardware-Intensive Tasks - Less memory - Fastest computation - Motion Peripheral Hardware Power Management Peripheral Tens of Nanoseconds PWM Current sensing Nanoseconds Fig. 1. Servo drive system functional overview [15]. Power Module Consequently, the computational requirements can be divided into two segments as shown in Fig. 1: - A HW-rich environment for tasks that need very fast computation, - A SW-intensive environment for tasks in which performance is less critical [15]. IV. DESIGN APPROACH AND SOC ARCHITECTURE To highlight the new FPGA reconfigurability performance, expanded by introducing programmable embedded cores, we have chosen to implement all control tasks using fully SW functions, avoiding the design of dedicated hardware blocks. The embedded RISC core runs both control application code as well as dedicated internal-only code that serves RT functioning. The Configurable Logic Blocks (CLBs) around the embedded processor are mapped to perform different functions such as peripheral drivers and memories. They are integrated in the design as configurable HW Intellectual Property (IP) (Fig. 2). Hence, with this flexible platform, new avenues of research are possible for any kind of servo drive application. IP cores for Microprocessor Peripherals Hardware Flow Platform Generate Assigning FPGA Pins Data Analysis Additional components coded in VHDL Synthesis Buil & Map Place & Route GPIO IP Netlists Bitstream UART interface Final Bitstream File JTAG interface SoC FPGA Embedded Processors IP Cores Fig. 2. RT HW/SW design flow. Software Flow Code C/C++ Compiler Object files Linker Executable SoC Main Borad BRAMs Libraries Libraries Generator The next section IV.A details our HW-SW design approach. A. Hardware Design 1. ML310 Platform and XPS Environnment The Xilinx ML310 board [16] is a complete reconfigurable design platform based on a Multi-Processor SoC (MPSoC): the Xilinx Virtex-II Pro XC2VP30 FPGA. This FPGA combines more than 30,000 CLBs and 2 PowerPC405 hardcores on a single chip. The available IBM CoreConnect buses on the FPGA connect the CPU to the large amount of peripherals of the board. They offer a variety of different interfaces to design complete MPSoC micro-architectures. The Xilinx Platform Studio (XPS) serves the need for platform building tools [7]. It supports a wide variety of Xilinx FPGA-based boards and systems such as the ML310. XPS makes it practical for an embedded systems designer to IEEE Preprint of IECON 2009 Proceedings

59 assemble a complete system within a single FPGA, thanks to its Embedded Development Kit (EDK). This kit provides the tools and libraries to integrate the PPC405, the MicroBlaze, and customizable peripherals. As detailed in Fig. 2, it includes a complete set of GNUbased SW tools including the compiler, assembler, debugger, and linker. 2. Designed Architecture The Fig. 3 presents the different FPGA embedded system components used in our design to implement a motor closed control loop. The architecture is based on a single processor. The system design is implemented in the Virtex-II Pro FPGA device on the ML310 board. It consists of the following: PowerPC 405, running at 100 MHz, 128 KB of on-chip Block RAM (BRAM), connected to the PLB (Processor Local Bus), used for all instruction and data storage, RS232 serial channel, connected to an OPB (Onchip Peripheral Bus) UART peripheral, used for communication between a PC and the platform, General Purpose Input Output (GPIO) OPB peripheral, used for time execution measurements on logic analyser, Timer/Counter OPB peripheral, used to synchronize the RT scheduling, Interrupt controller OPB peripheral, used to manage multiple interrupts. The PLB provides a high-bandwidth, low-latency connection between bus agents whereas the OPB provides a flexible connection path to peripherals and memory of various bus widths and transaction timing requirements but has minimal performance impact to the PLB bus [16]. PLB OPB DPLB PowerPC PPC405 IPLB PLB BRAM Controller SPLB PortA PLB to OPB Bridge SPLB MOPB PortA BRAM SOPB SOPB SOPB SOPB UART Int. Controller Timers 1,2 Bus peripheral interface Master connection Slave connection GPIO Fig. 3. HW design based on PPC 405. The hard-core PowerPC405 [17], is available in some Xilinx FPGA Virtex families. Its architecture is a 64-bit architecture with a 32-bit subset. It includes memory management unit, instruction and data cache built into the silicon of the hard processor to accelerate processing. This core provides flexibility by allowing degrees of SW compatibility across a wide range of implementations. B. Software Design 1. Real-Time Embedded Systems A RT system interacts with the environment by performing pre-defined actions on events within a certain time. The action of a special event is typically defined in a task and within a certain time that forms the deadline for a task [18]. A RTOS also called a RT kernel handles the scheduling of the SW tasks that run on a processor. It performs a multitasking RT system. A RTOS allows applications to be easily designed and expanded. Indeed, the use of a RTOS simplifies the design process by splitting the application code into separate tasks. So functions can be added without requiring major changes to the SW. Hence, multitasking allows a modularized solution and increasing code reuse [19][20]. With RTOS, all time critical events are handled as quickly and as efficiently as possible. It provides facilities for creating and scheduling several tasks within the same program to have faster task switch and unrestricted access to shared memory. RTOS simplifies also the communication and synchronisation mechanism. Many embedded systems can benefit from using the RTOS approach involving the use of multiple concurrent tasks communicating among themselves, all managed by a kernel with clearly-defined run-time behaviour [21]. The timing properties of the RT Kernel directly influence the timing properties of the embedded system. Therefore, it is essential that the RT Kernel should be thoroughly analysed and tested so that it is deterministic and predictable. RT scheduling has two main approaches. On the one hand, we have off-line scheduling, where all scheduling decisions are calculated by the system designer before runtime and stored in a runtime dispatch table. On the other hand, is online scheduling, where all scheduling decisions are calculated by the scheduling algorithm at run-time. However, it is easier for an off-line scheduler to be optimal, since the time that such a scheduler may consume is potentially unlimited. Moreover, a static scheduling, always planed off-line, requires that the attributes associated to tasks (number of tasks, deadlines, priorities, periods, etc.) are known a priori. This has a good determinism advantage but this kind of scheduling is inflexible. 2. RTOS for High Speed Control Digital control systems are high-speed applications that have RT constraints in the design process and need determinism and predictability in their code execution [22]. As seen in Fig. 1, complex control system [16] involves sub-systems with different dynamics and each parameter requires a different processing time. In order to satisfy a fully SW multitasking control algorithm with the last various processing speeds, a RTOS is needed. But, the most important consideration when choosing a RTOS for control applications is its reliable performance. Nowadays, FPGA new internal design, highly specialized, has made RTOS integration in this SoC effective by including an embedded processor. This feature makes the system scalable and more robust. In the SW flow, the RTOS can be IEEE Preprint of IECON 2009 Proceedings

60 structured as a library. So the high-level user application source files could link with the RTOS to access its functionality. Both µc-os/ii and Xilkernel have been inserted in the SW design to manage the RT constraints and priorities. They have been chosen among various existing RTOSes due to their interesting features detailed in the following. a. µc/os-ii µc/os-ii [23] is a completely portable, ROMable, scalable, and RT multitasking kernel. It is portable since it has been written in ANSI C and contains a small portion of assembly language code to adapt it to different processor architectures. µc/os-ii can be ported to different processor architectures, among them, the PowerPC 405. It is a small RT kernel with a memory footprint of about 20Kb, it can be scaled down in footprint if the application requires fewer features. But full µc/os-ii provides various services in its processor-independent code segment. µc/os-ii multitasking is priority-based. Tasks that share the same priority will execute in a round-robin fashion. Preemption is supported in order to perform a time-critical function. Moreover, µc/os-ii is freeware for academic purpose and a well-documented source code. This makes it a good candidate for our work. b. Xilkernel Xilkernel [7] is a small, light-weight (from 16 to 32 kb for PowerPC), and modular kernel. Kernel modules needed for the application can be selected and customized in the design at hand. It is a scalable kernel that can be accommodated into a given system through the inclusion or exclusion of functionality as required. This high degree of customization lets users tailor the kernel to an optimal level both in terms of size and functionality. Xilkernel is integrated with the Xilinx Platform Studio framework and is a free SW library that we get with EDK. It works with PowerPC 405 processor. The fact that it is free and integrated in EDK makes it a good candidate for testing purpose, before porting an application to more advanced OS. This Micro kernel supports the standard core features required in an embedded RT kernel. It is more a thin library providing minimalist but essential services than a complete OS with a POSIX subset interface. V. IMPLEMENTATION RESULTS A. Case study of PI Controllers for DC Motor Drive To validate our HW-SW implementation approach, we have opted to an electrical servo motor case study using a DC motor closed-loop system. The control loop coupled with the emulated physical process is detailed in Fig. 4. Ω ref + - I PI speed ref + controller - PI current controller Ω m α Inverter : DC chopper Fig. 4. Control loop for DC motor drive. I m Emulator V h Motor +Load The emulator should be an embedded electronic system, reproducing the physical system functioning in RT and with high precision. It is used for the control device validation and diagnosis [24]. These different sub-systems have various sampling times and should operate at different update rates as indicated in Table 1. TABLE I PROCESSING SPEEDS AND PRIORITY Task Time Period Processing priority Emulator 300us 100us 1 (cache disabled) (cache enabled) PI current controller 300us 2 PI speed controller 20ms 3 The closed control loop of Fig. 4 involves two controllers: the fast current controller unlike the slow speed controller should be implemented in high priority task. Furthermore, to be closer to the real physical motor, the emulator task should be the highest priority task allowing the motor to work at an optimal time period (Table 1). B. Embedded Control Validation To evaluate the good functioning of our embedded control loop, we have plotted the rotation speed parameter Ω m generated by the emulator at fixed time intervals. We have inserted printing functions in the SW code that print data on a computer through the UART serial device on the experimental board. We have been interested in comparing the response of the PI control algorithm with the support of the two retained RTOSes: the µc/os-ii and the Xilkernel. The generated graphs are presented in Fig. 5. This test intends to validate our case study implementation and to highlight the benefits of using cache in our design. We note from Fig. 5 that for both cases, the rotation speed parameter reaches the fixed reference speed of 100 rad/s. However, for the current graphs of Fig. 5, the case of µc/os- II support shows the best precision. Moreover, Table 2 shows a clear difference in the execution time between the two cases of enabled cache and disabled one: the closed loop response goes faster when the cache is enabled in the system design for the two RTOSes. This confirms the fact that enabling the cache is almost always a performance advantage for the PowerPC. IEEE Preprint of IECON 2009 Proceedings

61 Motor current Im (A) µc/osii support 12 Xilikernel support Rotation speed (rad/s) DC-Motor control with µcosii-xilkernel support using cache µc/osii support Xilikernel support Measurements Fig.5. Comparison between two PI-controller SW implementations. TABLE II DIFFERENT SUBSYSTEMS TIME MEASUREMENT Task Execution Time µc-os/ii (cache disabled) µc-os/ii (cache enabled) Xilkernel (cache enabled) Emulator 138us 22 us 22.5 us PI Current controller 114us 16 us 16 us PI Speed controller 66us 12 to 14 us 12 to 14.4 us C. Real Time Embedded Controller Performance Several timing measurements have been conducted for the different subtasks of the control algorithm with cache enabled to test the speed performance of our control algorithm. An external logic analyser has been connected to our target platform using GPIO pins (Fig. 6) to analyze the various processing parameters and to get high-level information about the RTOS efficiency. This basic timing-analysis lets us reconstruct the embedded controllers to be more deterministic. Emulator Current Controller Speed Controller GPIO Internal register External I/O Fig. 6. Connection between the board and the analyser. Besides, as every RT kernel, the core of the two used RTOSes is configured with an interval timer using a RT interrupt clock as a HW peripheral device. This HW timer periodically interrupts the processor to invoke the scheduler, generates and executes a tick routine. The two kernels efficiently manage multiple tasks using a priority-ordered list of waiting events. But they operate in different ways and offer different functionalities particularly for the delay mechanism. µc/os-ii uses the OStimedelay () routine that represents a task delay in number of ticks. However, Xilkernel uses the Sleep () routine that represents a task delay in number of core period time. If no task is running, and all tasks are not in the ready state, the idle task executes. The idle task is always the lowestpriority task and can not be deleted or suspended by user tasks [23]. In the generated chronograms of Fig. 7 and 8, we see exactly what happened in our code execution and what portion of code consumes the most processing time. SCHEDU: SCHEDUler - ISR: Interrupt Service Routine Fig. 7. Motor control with µc-os/ii support. SWITCH: context SWITCH Fig. 8. Motor control with xilkernel support. Fig. 7 and 8 describe clearly the respect of the fixed priority scheduling policy (Table 1) and show the periodic system behaviour with the two RTOSes. But in measuring the average-case context-switch time, the µc/os-ii system shows more performance than the Xilkernel one. Table 2 shows the high speed performance of our embedded control algorithm with the two RTOS supports. The different execution times of the algorithm subtasks, when cache is enabled, are in the range of tens of microseconds. D. Speed/Area Tradeoff The design summary given by Table 3 depicts the low HW cost for our design implemented on the XC2VP30ff896-6 FPGA device (about 15% of the total on-chip resources). This offers the opportunity to add several customs HW IPs for dedicated hardware blocks such as Pulse Width Modulation or to map the time critical current task onto HW. Furthermore, the used FPGA incorporates a storage hierarchy to balance performance demands against resource cost: the off-chip storage is the least costly, but provides the slowest-access memory. Whereas the internal storage has an increasing access speed, such as on-chip memory (BRAMs) or on-processor memory (caches). In order to record the parameters values of the closed control loop during a defined time period, a certain storage space has been taken from the free space of the used BRAMs. Its sizing depends on how IEEE Preprint of IECON 2009 Proceedings

62 many parameters are selected to be recorded and on the number of samples required. This has consequently increased the number of used BRAMs (approximately 25% of the total number). We deduced from this design summary that this FPGA device offers on-chip free storage allowing the implementation of a more complex software control algorithm. TABLE III DEVICE UTILIZATION SUMMARY Embedded processor core PowerPC405 Number of used BRAMs 32 out of 136: 23% Number of occupied Slices 1,416 out of 13,696: 10% Total equivalent gate count 4,253,390 out of 30,000,000: 14% VI. CONCLUSION AND FUTURE WORK This paper has detailed the effectiveness of the new FPGA technology for embedded control. FPGAs with on-chip embedded processors are interesting for HW-SW co-design, they are considered as appropriate solutions in order to boost performance of controllers and ensure high flexibility. Experimental results for the DC motor control case have shown that enabling the cache on the hard-core PPC405 enhances performance for a fully-software control algorithm. The designed system consumes few on-chip resources and can easily be adapted to more complex motor drive case studies. Changes for adaptation are minor and only involve the embedded software code of the target platform core. By switching to software, we alleviate the need to design a dedicated hardware block for a task, which saves design/verification time and reduces chip area. Two types of RTOS have been integrated in the software design to support the embedded control application. Both RTOSes have answered the RT challenges posed by a standard control application with multiple concurrent tasks and interrupt sources. The Xilkernel is simpler to implement but µc-os/ii offers more RT functionalities and determinism. In general, this RTOS support allows time critical events to be handled efficiently, and it ensures a high degree of determinism and modularity. The full flexibility in FPGA design offers the opportunity to make critical functions (such as current controller) faster by moving them to the HW domain. FPGA gives a high bandwidth data transfer mechanism via the PLB of the PPC405 core. This flexibility feature could significantly increase the speed of the whole control system. [6] A. K. Ben Salem, S. Ben Othman and S. Ben Saoud, Hard and Soft- Core Implementation of Embedded Control Application Using RTOS, IEEE Proc. ISIE, Cambridge, UK, Jun. 2008, pp [7] Xilinx Inc., Website: [8] F. Vahid and T. Givargis, Highly-Cited Ideas in System Codesign and Synthesis, International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Atlanta, Oct.2008, pp [9] M. Theissinger, P. Stravers, and H. Veit, CASTLE: an Interactive Environment for HW-SW co-design, in Proc. Hardware/Software Codesign, Sep. 1994, pp [10] D. Araki, T. Ishii, and D. D. Gajski, Rapid Prototyping with HW/SW Codesign Tool, in IEEE Proc. Engineering of Computer-Based Systems: ECBS, Nashville, USA, Mar. 1999, pp [11] S. Ben Saoud, A. Gerstlauer, and D. D. Gajski, Codesign Methodology of Real-time Embedded Controllers for Electromechanical Systems, American Journal of Applied Sciences, no. 2, pp , Sep [12] D. He and R.M. Nelms, Fuzzy Logic Average Current Mode Control for DC/DC Converters Using an Inexpensive 8bit Microcontroller, IEEE Conf. on Industry Applications, Oct. 2004, pp [13] D. Hadiouche, L. Baghli, and A. Rezoug, Space Vector PWM techniques for dual three-phase AC machine: Analysis, performance Evaluation, and DSP Implementation, IEEE Trans. on Industry Applications, vol. 42, no. 4, pp , Jul-Aug [14] A. Fratta, G. Griffero, and S. Nieddu, Comparative Analysis Among DSP and FPGA-based Control Capabilities in PWM Power Converters, in Proc. IEEE IECON, Busan, Korea, Nov. 2004, pp [15] J. Bhoot and T. Takahashi, Platform Delivers Fast, Flexible AC Servomotor-Control Designs, Xcell Journal, Summer [16] Xilinx Inc., Virtex-II Platform FPGA User Guide, version 4.0, [17] Xilinx Inc., PowerPC Processor Reference Guide, UG011 (v1.2), Jan [18] A. Gambier, Real-time Control Systems: a Tutorial, in Proc. Asian Control Conference, Melbourne, Australia, Jul [19] R. Yerraballi, Real-Time Operating Systems: An Ongoing Review, in IEEE Proc. Real-Time Systems Symposium, WIP Section, Orlando Fl, Oct [20] M. Jacomet, J. Goette, J. Breitenstein, and M. Hager, On a Development Environnment for Real-Time Information Processing in System-On-Chip Solutions, in Proc. Integrated Circuits and Systems Design, Pirenopolis, Brazil, Sep. 2001, pp [21] F. Engel, G. Heiser, I. Kuz, S. M. Petters and S. Ruocco, Operating Systems on SoCs: A Good Idea? in IEEE Proc. ERTSI Workshop, Lisbon, Portugal, Dec [22] B. Shao, R. Wang, Beijing, Embedded RT Systems To be Applied in Control Subsystems For Accelators, The International Workshop on PCs and Particle Accelerator Controls, KEK Tsukuba, Japan, Jan [23] J. J. Labrosse, MicroC/OS-II: The Real-Time Kernel. CMP Books, [24] S. Ben Saoud, D. D. Gajski, Co-design of Emulators for Power Electric Processes Using SpecC Methodology, UC Irvine, Technical Report ICS-TR-01-46, Jul REFERENCES [1] E. Monmasson and M. N. Cirstea, FPGA Design Methodology for Industrial Control Systems A Review, IEEE Trans on Industrial Electronics, vol. 54, no. 4, pp , Aug [2] K. Eshraghian SoC Emerging Technologies, IEEE Proc on JPROC, vol. 94, no. 6, Jun. 2006, pp [3] S. Pasricha and N.D. Dutt, Trends in Emerging On-Chip Interconnect Technologies," IPSJ Trans. on System LSI Design Methodology, Sep (Invited Paper). [4] M. Dominique, FPGA de la «glue logic» au System On Programmable Chip, Revue 3EI, vol.14, n. 53, pp. 5-10, Jun [5] D. A. Finkelstein and H. Hadimioglu, A Scalable Processor with Embedded Software for Large-Scale Scientific Applications, 2nd Workshop on ARFP, Austin, Texas, IEEE Preprint of IECON 2009 Proceedings

63 1 Impact of real-time enhancements on the system performances for multi-core Intel architectures Nabil LITAYEM, Mohamed Aymen SIALA, Ahmed BEN ACHBALLAH, Slim BEN SAOUD LECAP-EPT-INSAT BP 676, 1080 Tunis Cedex, Tunisia Abstract Linux real-time extension became a dominant choice in modern control applications. Actually many extensions exist to offer real-time performance for Linux kernel. In this paper we study the impact of real time enhancements in the system performances for multi-core Intel architecture. The current trend is to modify the internal Linux kernel behaviour to offer real-time performances. The most known flavour for this approach is PREMPT_RT patch which is actually mainlined in the current kernel. Different metrics are used to reflect the performance of a real-time computer system that is latency and computation power. These two metrics are both important in the embedded control applications and wise choice must be done to choose the most adapted kernel for a given hardware architecture to reach modern control application needs. Using different Linux kernels we tried to reflect the impact of latency improvement across the system performance under new multicore Intel architecture. Index Terms Real-time systems, Linux, PREEMPT-RT, control applications. I. INTRODUCTION Modern control applications require many newer functionalities like GUI (graphic user interface), communication possibilities and great software reusability. Classic RTOS (Real Time Operating System) can reach the timing performance but has many weaknesses concerning others required aspects. In the other hand Linux kernel obeys the others requirement but it can t be used as a hard real time system. These reasons push many initiatives to integrate real-time performances in the Linux kernel. Actually many approaches are existing offering these functionalities using different architectures [1]. The most known solutions are RTAI, Xenomai and PREEMPT-RT [2] patch, each one of these approaches has their internal architecture, their strength and weaknesses. The widely available choice in term timing, performances and functionalities between different Linux kernels variants makes of Linux one of the most suitable embedded operating system, widely adopted for different embedded applications with different types of constraints. Actually PREEMPT-RT patch is the most successful approach in the industrial field and is finally mainlined in the current kernel. This patch is also known to offer a good realtime performance. Other successful real-time projects try to merge with PREEMPT-RT especially Xenomai [3] in their approach called Xenomai/Solo, which port Xenomai capabilities to PREEMPT-RT. In the other hand, the embedded processing requirements are increasing at an exponential rate; the supply in terms of embedded processors is becoming increasingly broad. Different platforms can be adopted and used in embedded field, classically FPGA and DSP architectures are widely adopted in the embedded high performance field. Actually we assist to the convergence of PC and embedded architecture. Different conventional microprocessor actors try to enlarge their activities with processor that can be used in both standard and embedded computer. Intel with its ATOM TM processor and AMD with its GEODE TM processor are considered as interesting candidates in the embedded field. Other high end processors designed to desktop and server systems become adopted in industrial computer for embedded control usage. In this paper, we try to investigate the usability of industrial computer for real-time control applications using PREEMPT-RT patch. For this goal we evaluate the real-time performance of the PREEMPT-RT patch using cyclictest real time benchmark. Then we compare the results with other the one obtained from server and desktop variance. We proceeded later to system performance evaluation using UnixBench for the different previously used Linux kernel. This approach is adopted to study the timing performance of Linux and its impact on system performance for both single and dual-core systems. The studied platform is a standard computer with core 2 duo Intel microprocessor, 3 GByte of DDR2 RAM and running Ubuntu Linux This platform is similar to high-end industrial computers designed for control purpose. This paper is organized as follows. Section 2 presents a survey of dominants real-time open source Linux solutions. We studied the latency of different kernel versions using cyclictest benchmark in Section 3. The system performance evaluation is presented in Section 4 for both single-core and dual-core system. Conclusions and discussion are related in Section 5.

64 2 II. REAL-TIME LINUX EXTENSIONS A. Real-time Linux Due to the increasing Linux popularity in embedded systems field, many efforts were made and proposed to transform Linux kernel into a real-time kernel. These works resulted in several implementations of real-time Linux. Actually there are many existing implementation of real-time extension for Linux kernel [4]. These extensions can be classified in two categories according to the approach used to improve the real-time performance of the Linux kernel. The first approach consists of modifying the kernel behaviour to improve its real-time characteristics. The second approach consists of using small real-time kernel to handle real-time tasks and who can run the Linux kernel as low priority task. Actually many researches and industrial efforts are made to enhance the real-time capability of the different real-time Linux flavours [5] for different perspective and different applications domains. These works can be classified in two categories. The first one is about scheduling algorithm and timer management. The second category is about application domain such as Hardware-in-the-Loop simulation system [6], model based engineering [7] and real-time simulation. In the following section we ll present the different open source real-time extensions and the technique used on it to bring real-time behaviour in this Linux kernel. In the last few years three real-time extensions are widely adopted. These solutions will be presented later in this paper B. RTAI RTAI is a real-time application [8] interface usable for both uni-processors and symmetric multi-processors (SMPs), that allows the use of Linux in many "hard real-time" applications. As an option, RTAI's "LXRT" allows the control of real-time tasks, using all of RTAI's hard real-time system calls, from within Linux memory-protected user space resulting in soft real-time combined with fine-grained task scheduling. RTAI is the real-time Linux that has the best integration with others open source tools scilab/scicos [9], this extension is widely used in control applications. C. PREEMPT-RT The PREEMPT-RT patch converts Linux into a fully preemptible kernel. It allows nearly the entire kernel to be preempted, with the exception of a few very small regions of code. This is done by replacing most kernel spinlocks with mutexes that support priority inheritance and are preemptive, as well as moving all interrupts to kernel threads. (Dubbed interrupt threading), which by giving them their own context allows them to sleep among other things. D. Xenomai Xenomai is a real-time development framework that provides hard real-time support for GNU/Linux. It implements ADEOS (I-Pipe) micro-kernel between the hardware and the Linux kernel. I-Pipe is responsible for executing real-time tasks and intercepts interrupts, blocking them from reaching the Linux kernel to prevent the preemption of real-time tasks by Linux kernel. Xenomai provides real-time interfaces either to kernel-space modules or to user-space applications. Interfaces include RTOS interfaces (psos+, VRTX, VxWorks, and RTAI), standardized interfaces (POSIX, uitron), or new interfaces designed with the help of RTAI (native interface). This feature made that Xenomai was considered as the RTOS Chameleon for Linux. It was designed to enable smooth migration from traditional RTOS to Linux without having to rewrite entire application. III. TESTING REAL TIME CHARACTERISTICS A. Real time benchmarks To reflect real-time performance several benchmark exist, each one has its approach and focalizes in a well determined aspect. The most known real time benchmarks are: Lpptest, cyclictest, LRTB, Houglass, Woerner test and Senoner test. Cyclictest can be used to measure the latency of determined system by measuring the time between configured timer expiration and the actual expire time. B. Evaluating latency of different Linux kernel using cyclictest Cyclictest benchmark can be used with different parameters to determine the latency of different samples or only the average and maximum latency. In our case we used the verbose mode to study statistically the latency and the silent mode to determine the average and the maximum latency. 1. Statistic latency study To study the latency of different Linux kernels, we used the cyclictest benchmark for three domain specific kernels which are server, desktop and PREEMPT-RT. The results of these different kernel versions are plotted using Matlab to show statistically the response time of samples. Figure 1, 2 and 3 illustrate the latency of the server, desktop and PREEMPT-RT kernels. The three figures show the latency in us ( X axis) for a given sample (Y axis). Sample Figure 1. Latency of the server kernel (µs)

65 3 Sample (µs) generic-pae generic rt Figure 4. Average Cyclictest Latency for different Linux kernel Sample Figure 2. Latency of the desktop kernel (µs) (µs) Maximum cyclictest latency Figure 3. Latency of the PREEMPT-RT kernel (µs) 2. Maximum and Average response time study PREEMPT-RT patch introduces new operating system enhancements to minimize both maximum and average response time of the Linux kernel. These enhancements are gradually added to the Linux kernel to offer real-time capabilities. The most important enhancements offered are: High resolution timers Complete kernel preemption Interrupts management as a threads Hard and soft IRQ as a threads Priority inheritance mechanism Some of these new features like Threaded IRQ are currently pushed to the mainline kernel by the patch maintainers. In order to evaluate the impact of these new features for maximum and average latency, we measured these values for the three studied kernel using cyclictest benchmark. These results are showed in the Figure 4 and 5. Figure 5. Maximum cyclictest latency for different Linux kernel C. Interpretations Comparing the results of cyclictest latency time for the three Linux kernels we remark that the latency of the server and desktop kernel are comparable with a little superiority for the server kernel for the average values. Otherwise server kernel presents some jitter reaching 780 µs. For its part RT_PREEMP kernel has a good latency characteristics booth in the average and the maximum values. IV. SYSTEM PERFORMANCE A. System performance evaluation and benchmarking New Advances in computer technology introduce new requirements and constraints for system performance evaluation especially with the emergence of multi-core architecture. System performance testing is a very helpful in the design flot since it can offer an image of a whole system including all its aspects. Several system benchmark suites exist; this benchmark can be classified as follow: CPU benchmark Embedded and media benchmark Language specific benchmark Transaction processing benchmark Web server benchmark Domain specific benchmark Every benchmark from the presented categories should be representative of the applications that can run on the studied systems. These different categories and its relevant benchmark are detailed in the book [10].

66 4 Actually with the convergence of desktop and embedded systems, system benchmark can be used. The most useful benchmarks in our case are LMBench, UnixBench and Nbench. We adopted UnixBench for the rest of our work since this benchmark is updated to support multiprocessor system and for it has a great portability under different UNIX systems. B. UnixBench UnixBench is designed to extract a basic performance indicator of a UNIX system. Various aspect of the system are reflected using an index to compare the performance of the current system to a reference system. The entire set of index values is then combined to make an overall index for the system. UnixBench can also handle Multi-CPU systems. The advantage of this benchmark in our study is its capability to reflect the performance of the overall system (including operating system and used compiler) not only the available hardware which is the case of real systems. The individual performance reports indicate the performance of the system in a different specific domain like integer or floating point computation. The system benchmark score indicate the performance of the global system. For the two reports the upper score indicate better performances. C. System performances using single processor core UnixBench is able to detect the different CPU available on the studied system and parallelize its different benchmark on these CPUs. The reported values are given for both single and multi-core configuration. Figure 6 and 7 show respectively the individual performance and score performance for single-core system Figure 7. System Benchmarks score for different studied kernel running under a single-core D. System performance using two core After executing UnixBench benchmark under multi-core system. It can report the offered performance of the used system. In our case since we studied core 2 duo architecture, UnixBench can report the performance of single and dual-core architecture. Figure 8 and 9 report respectively the individual performance and the score of dual-core system System Benchmarks Index Score generic pae generic rt generic-pae generic rt Dhrystone 2 using register variables Double-Precision Whetstone Execl Throughput File Copy 1024 bufsize 2000 maxblocks File Copy 256 bufsize 500 maxblocks File Copy 4096 bufsize 8000 maxblocks Pipe Throughput Pipe-based Context Switching Process Creation Shell Scripts (1 concurrent) Shell Scripts (8 concurrent) System Call Overhead Figure 6. UnixBench individual performance for single-core architecture 0 Dhrystone 2 using register variables Double-Precision Whetstone Execl Throughput File Copy 1024 bufsize 2000 maxblocks File Copy 256 bufsize 500 maxblocks File Copy 4096 bufsize 8000 maxblocks Pipe Throughput Pipe-based Context Switching Process Creation Shell Scripts (1 concurrent) Shell Scripts (8 concurrent) System Call Overhead Figure 8. UnixBench individual performance for dual-core architecture

67 Figure 9. System Benchmarks score for different studied kernel running under a dual-core System Benchmarks Index Score E. Interpretations The measured values for single-core architecture show global performances degradation caused by real-time capabilities in the order of 16 % compared to the standard Linux kernel. In the other hand a dual-core architecture shows a considerable degradation for the PREEMPT_RT patch in the order of 35% compared to the standard Linux kernel. More else, we can conside that the performance of dual-core architecture is 34% higher than single-core architecture. A wiser choice can be the adoption higher performance singlecore processor instead of dual-core. This kind of results can be explained by the maturity of PREEMPT-RT multi-core support. Power technologies A case study, In Proceedings of the Ninth Real-Time Linux Workshop in Linz, November [3] P. Gerum, Xenomai - implementing a rtos emulation framework on gnu/linux, 2004, [4] M. Tim Jones, Anatomy of real-time Linux architectures From soft to hard real-time, IBM 15 Apr 2008 [5] Z. Chen, X. Luo, Z. Zhang, Research Reform on Embedded Linux s Hard Real-time Capability in Application, Embedded Software and Systems Symposia, ICESS Symposia '08. International Conference on July 2008 Page(s): [6] F. Jiang, S. Gao Jie Zhang, A Hardware-in-the-loop Simulation System of Diesel, Power and Energy Engineering Conference, APPEEC Asia-Pacific Engine Based on Linux RTAI March 2009 Page(s):1-4. [7] G. Doukas, and K. Thramboulidis, A Real-Time Linux Based Framework for Model-Driven Engineering in Control and Automation, Industrial Electronics, IEEE Transactions on : Accepted for future publication, Volume PP, Forthcoming, 2009 Page(s):1-11. [8] G. Doukas, A. Brusaferri, M. Colla, K. Thramboulidis, RTAIbased execution environments for function block based control applications, Emerging Technologies & Factory Automation, ETFA. IEEE Conference on Sept Page(s): [9] R. Bucher, S. Balemi, Scilab/Scicos and Linux RTAI - a unified approach, Control Applications, CCA Proceedings of 2005 IEEE Conference on Aug Page(s): [10] L. Kurian, J. Lieven Eeckhout, Performance Evaluation and Benchmarking, Published in 2006 by CRC Press. V. CONCLUSIONS This paper provided a study of latency enhancements introduced by PREEMPT-RT patch and its impact on system performance. We can conclude that the considerable latency improvement offered by PREEMPT-RT patch can impact negatively the global system performance of the computer system. The system performance degradation is especially considerable in the dual-core for PREEMPT-RT patch since the system performance is about the half of performance offered by server kernel. This consideration can be justified by the maturity of the PREEMPT-RT patch for such architectures. As follow up to this work, we plan to investigate performance offered by other real-time extension such as Xenomai and RTAI. The concept of domain offered by these two alternatives can be explored to minimize the overhead of real-time extension by affecting real-time and non-real-time to different processor core. VI. REFERENCES [1] K. Yaghmour, G. Ben-Yossef, and P. Gerum, Building Embedded Linux Systems, O Reilly [2] M. Mossige, P. Sampath, and R. Rao, Evaluation of Linux rtpreempt for embedded industrial devices for Automation and

68 Etude de l Influence de la Communication Processeur Memoire sur la Performance d un SoC à base de circuit FPGA XILINX Nabil LITAYEM, Meftah GHRISSI, Bilel FITOURI, Slim BEN SAOUD Institut National des Sciences Appliquées et de Technologies INSAT Centre Urbain Nord BP Tunis Cedex LECAP - EPT [email protected] [email protected] [email protected] Mots Clefs Processeur Soft Core, Processeur Hard Core, Communication processeur mémoire, FPGA, Systèmes embarqués, Benchmark. Résumé Les exigences des traitements embarqués augmentent à un rythme exponentiel, l offre en terme de processeurs embarqués devient de plus en plus large vue la multitude de choix offert aux concepteurs, les critères de choix sont aussi très variés (performances, cout, consommation, outils liés ). Pour les plateformes FPGA on dispose actuellement de deux catégories de processeurs embarqués, les processeurs Soft Cores fournit sous forme de code source HDL ou sous forme de netliste et les processeurs Hard Cores gravés au niveau silicium sur plusieurs plateformes FPGA. A chaque processeur est lié un ensemble de moyens permettant de l interconnecter avec sa mémoire. Dans cet article on propose une étude de l influence de la communication processeur mémoire sur la performance d un SoC, Dans cette étude on utilisera deux processeurs embarqué, MicroBlaze et PowerPC sur une plateforme FPGA Virtex-II pro de XILINX. 1

69 Etude de l Influence de la Communication Processeur Memoire sur la Performance d un SoC à base de circuit FPGA XILINX Nabil LITAYEM, Bilel FITOURI, Meftah GHRISSI, Slim BEN SAOUD Institut National des Sciences Appliquées et de Technologies INSAT Centre Urbain Nord BP Tunis Cedex LECAP - EPT [email protected] [email protected] [email protected] Résumé Les exigences des traitements embarqués augmentent à un rythme exponentiel, l offre en terme de processeurs embarqués devient de plus en plus large vue la multitude de choix offert aux concepteurs, les critères de choix sont aussi très variés (performances, cout, consommation, outils liés ). Pour les plateformes FPGA on dispose actuellement de deux catégories de processeurs embarqués, les processeurs Soft-Cores fournit sous forme de code source HDL ou sous forme de netliste et les processeurs Hard-Cores gravés au niveau silicium sur plusieurs plateformes FPGA. A chaque processeur est lié un ensemble de moyens permettant de l interconnecter avec sa mémoire. Dans cet article on propose une étude de l influence de la communication processeur mémoire sur la performance d un SoC, Dans cette étude on utilisera deux processeurs embarqué, MicroBlaze et PowerPC sur une plateforme FPGA Virtex-II pro de XILINX. Mots Clefs Processeur Soft Core, Processeur Hard Core, Communication processeur mémoire, FPGA, Systémes embarqués, Benchmark. I. INTRODUCTION La performance des systèmes embarqués est un des aspects critiques qui doit être pris en compte pendant la phase de conception [1], [2]. Les solutions logicielles telles que les applications audio/vidéo, encodage/décodage, de traitement d'images ou les applications réparties demandent à la fois de la précision ainsi qu un moindre temps d exécution. Le taux d intégration croissant à permis d intégrer sur la même puce des systèmes complet nommés SoC [3]. La communication processeur mémoire, comme pour tout système informatique est un point clé qui influe sur les performances globales d un SoC. Dans cet article on présente une étude comparative des moyens de communication processeur mémoire et cela pour les deux processeurs MicroBlaze [4] et PowerPC 405 [5], afin de révéler l influence des moyens de communications sur les performances globales d un SoC construit sur une plateforme FPGA de XILINX. Cet article est organisé comme suit : La section 2 est consacrée à la présentation de la plateforme matérielle utilisée ainsi que les processeurs étudiés. Cela est suivi dans la section 3 par la présentation 2

70 des résultats obtenus par les deux benchmarks exécutés sur les architectures étudiés. Une discussion est proposée à la section 4. Enfin une conclusion et quelques perspectives sont proposées dans la section 5. II PRESENTATION DE LA PLATEFORME MATERIELLE ET DES PROCESSEURS UTILISES A. Présentation de la plateforme matérielle : La plateforme matérielle utilisée est la carte ML310 de XILINX illustrée par la Figure1. Elle est constituée autour d un FPGA Virtex-II pro [6] qui peut être muni d un ou de deux processeurs PowerPC d IBM cadencés à 400MHz (selon la gamme) [5]. Dans le cas de la ML 310 le Virtex-II pro inclut deux processeurs PowerPC, la carte est aussi équipé d un large éventail de périphériques (interface PCI, Compact Flash, Mémoire DDR, interfaces IDE, interface Ethernet ) Figure 1. Plateforme FPGA XILINX Virtex-II pro B. Processeur Hard Core PowerPC 405: Le processeur PowerPC 405 illustré par la Figure 2, est un processeur RISC 32 bits de la firme IBM. Il est essentiellement dédié pour les applications embarquées. Dans les plateformes FPGA Virtex-II pro ce processeur est gravé au niveau silicium, mais il peut être utilisé en combinaison avec les autres modules matériels conçu et synthétisé sur la plateforme FPGA. Figure2. Architecture du processeur PowerPC C. Processeur Soft Core MicroBlaze Le processeur Soft-Core [7] MicroBlaze illustré par la Figure 3 est un processeur RISC optimisé pour l implémentation sur des FPGA XILINX. C est un processeur hautement configurable [8]. Sa configuration est une tache très simple puisqu elle s effectue au niveau de l environnement de développement EDK et permet à l utilisateur de le paramétrer au moment de la conception du schéma logique. Il occupe entre 900 et 2600 cellules logiques (LUT 1 ) et atteint une fréquence de 80 à 180 MHz selon la plateforme (Spartan, Virtex...) et les options souhaitées. La simplicité de la configuration et la faible surface de silicium font de ce processeur un choix très intéressant par rapport à d autres Soft-Cores (LEON3 [9], Open Risc [10]) notamment sur les plateformes XILINX. Instruction-side bus interface IXCL_M D - IXCL_S c a c IOPB h e ILMB Bus IF Program Counter Instructi on Buffer Paramètres Special Purpose Registers Instructi on Decode ALU Shift Barrel Shift Register File 32 X 32b Data-side bus interface DXCL_M I- c a DXCL_S c h e DOPB Bus IF DLMB MFSL 0..7 SFSL 0..7 Figure 3. Architecture du processeur MicroBlaze 1 Look-Up-Table 3

71 III EVALUATION DES PERFORMANCES Un des moyens les plus efficaces pour l évaluation des performances [11] d un système à microprocesseur sont les benchmark [12], un benchmark est une application permettant de refléter la performance d un microprocesseur et du compilateur utilisé dans un domaine bien définie. Dans ce qui suit on utilisera les deux benchmarks Stanford et Dhrystone afin de refléter les performances des différentes architectures à étudier, pour toutes les architectures étudiées le compilateur utilisé est GCC. Les architectures étudiées sont différentes combinaison processeurs mémoire en utilisant les divers contrôleurs mémoire de XILINX [13], [14], [15], [16]. A. Evaluation des performances avec le benchmark «Dhrystone» 1. Présentation de «Dhrystone» Dhrystone est un benchmark synthétique permettant de refléter la performance du calcul entier d un système à microprocesseur, il est écrit en langage C, ce qui le rend hautement portable sur différentes architecture. Il est de très faible empreinte mémoire ce qui le rend inadapté pour les architecture PC actuel puisque il ne reflète pas l influence d une grande mémoire toutefois il est encore très adapté pour des systèmes embarqués. 2. Résultats obtenues avec «Dhrystone» Afin de pouvoir évaluer les performances des différentes architecture de mémoire pour les processeurs PowerPC et Microblaze on a exécuté les Benchmark sur les différentes architectures présentées dans le Tableau1. Dans les architectures utilisées avec mémoire cache, la taille du cache est de 16Ko pour les deux processeurs. On devrait toutefois noter que la taille de mémoire cache du Microblaze est configurable et peut atteindre une taille de 64 Ko, par contre pour le PowerPC cette taille ne peut que prendre la valeur 16 Ko. Tableau 1. Résultats obtenus avec le benchmark Dhrystone Processeur Memoire Cache DMIPS PowerPC MicroBlaze PLB NO 22.9 YES 114 OCM NO 84.4 LMB NO 91.4 OPB NO 15.7 SDRAM NO 4.57 YES Analyse des résultats obtenus L exécution du benchmark Dhrystone permet d aboutir aux conclusions suivantes : a) Les performances atteintes par le PowerPC et le Microblaze sont comparable si on utilise des mémoires internes (OCM pour PowerPC et LMB pour Microbaze). b) La meilleure performance atteinte est celle obtenue pour l architecture PowerPC utilisant des mémoires PLB avec 16Ko de cache, à travers cette architecture on peut constater qu un gain impressionnant en performance est obtenue en utilisant de la mémoire cache. c) le processeur MicroBlaze est un Soft Core mais peut atteindre des hautes performances comparables à celles du processeur PowerPC pour une même fréquence de fonctionnement. d) La perte de performance causée par l utilisation d une mémoire externe peut être compensée en utilisant une mémoire cache. B. Evaluation des performances avec le benchmark «STANFORD» 1. Présentation de STANFORD STANFORD est une petite suite de benchmark constituée des programmes suivants : Perm : un programme de permutation récursive 4

72 Towers : Programme de résolution d un problème de tours de Hanoi. Queens : Programme de résolution de huit problèmes de Queen à 50 temps. Intmm : Programme de multiplication de deux matrices entières. Mm : Programme de multiplication de matrices en virgule flottante. Puzzle : algorithme de calcul de bord. Quick : Programme de tri d un tableau utilisant l algorithme Quicksort. Bubble : Programme de tri d un tableau utilisant l algorithme Bubblesort. Tree : Programme de tri d un tableau utilisant l algorithme Treesort. FFT : Programme de calcul de la transformée de fourrier rapide. STANFORD mesure le temps d exécution en milliseconde pour chacun des huit petits programmes inclus dans le benchmark. Deux sommes pondérées sont calculées, la première reflète le temps d exécution pour les programmes en virgule fixe et la deuxième le temps d exécution pour les programmes en virgule flottante. Les coefficients de la somme pondéré sont prédéfinis de façon expérimentale. 2. Résultats obtenues avec Stanford Les mêmes architectures étudiées précédemment mise à part l architecture PowerPC avec un controleur OCM ont été évaluées par le benchmark Stanford. L élimination de architecture (OCM) est due au fait que Stanford ne peut pas être chargé sur une mémoire de ce type vue sa grande taille par rapport à Dhrystone. Le tableau suivant illustre les résultats obtenus avec ce benchmark. Tableau 2. Résultats obtenus avec le benchmark Stanford 3. Analyse des résultats obtenus a) En observant les résultats obtenus l architecture PowerPC avec mémoire PLB utilisant une mémoire cache et l architecture MicroBlaze avec mémoire LMB, on constate qu ils obtiennent des résultats comparables pour des algorithmes simples tels que Perm et Towers. Toutefois les écarts sont impressionnants notamment pour les algorithmes complexes incluant des opérations en virgule flottante tel que Mm et FFT. b) En examinant les résultats obtenus avec les architectures MicroBlaze utilisant une mémoire externe SDRAM, on constate que l utilisation d une mémoire cache améliore les performances de cette architecture et la ramène à des résultats proches de ceux obtenues en utilisant une mémoire LMB, notamment pour les algorithmes utilisant des opérations en virgule flottantes. Ceci nous ramène à conclure que les limitations de performance pour ces architectures est liée aux performances du cœur processeur et non à la communication avec la mémoire. c) les mémoires LMB, qui possèdent une faible latence, sont aussi performantes que les caches données et instructions utilisées avec le processeur PowerPC. Ce résultat apparaît également au niveau de l exécution du benchmark Dhrystone et montre qu un SOC intégrant un MicroBlaze peut parfois dépasser les performances que celui qui intègre un hardcore PowerPC. d) l utilisation d une mémoire externe est déconseillée pour les systèmes se devant d être performantes vu le nombre de cycle élevé de lecture/écriture qu elle consomme. Description Functions Cumul Processeur Mémoire Cache Perm Towers Queens Intmm Mm Puzzle Quick Bubble Tree FFT NFPC FPC PowerPC MicroBlaze PLB NO YES LMB NO OPB NO SDRAM NO YES

73 IV. CONCLUSION ET PERSPECTIVES Ce travail nous a permis d évaluer de près les performances que peut atteindre un système on chip implémenté sur circuit FPGA Virtex-II Pro. On a pu constater qu outre le choix du processeur, la mémoire joue aussi un rôle primordial sur les performances que peut atteindre un SoC. On a aussi constaté que les performances que peut atteindre un Soft Core sont comparables à celle atteinte par un Hard Core notamment si le choix de la mémoire utilisé est effectué de façon adéquate. Toutefois ce travail doit être compléter par une étude abordant la consommation d énergie des systèmes choisi [18]. Le concepteur pourra ainsi choisir de façon astucieuse l architecture à utiliser en prenant en considération les performances du système ainsi que sa consommation. References [12] Yingxu Wang and Hareton K.N.Leung, Benchmark- Based Adaptable Software Process Model, IEEE 2001, pp [13] XILINX, On-Chip Peripheral Bus V2.0 with OPB Arbiter (v1.10c), DS401 August 31, 2006 [14] Bas Breijer, Filipa Duarte, and Stephan Wong, An OCM based shared memory controller for VIRTEX 4, IEEE [15] XILINX, PLB usage in Xilinx FPGA, September [16] XILINX, Embeddesd Systems Tools Reference Manual, January, [17] Ray C.C. Cheung, Dong-U Lee, Oskar Mencer, Automating Custom Precision Function Evaluation for Embedded Processors, CASES 05, September 24 27, 2005, San Francisco, California, USA, pp [18] Anish Muttreja, Anand Raghunathan, Automated Energy/Performance Macromodeling of Embedded Software, IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 3, MARCH 2007, pp [1] Wayn Wolf, High-performance embedded computing, Elsevier 2007 ISBN 13: [2] Lizy Kurian John Lieven Eeckhout, Performance evaluation and Benchmarking, CRC Press 2006, ISBN [3] Ahmed Amine Jerraya, Sungjoo Yoo, Diederik Verkest, Norbert Wehnn, Embedded Software for SOC, KLUWER ACADEMIC PUBLISHERS 2003, ebook ISBN: [4] XILINX, MicroBlaze Processor Reference Guide, Embedded Development Kit, [5] XILINX, PowerPC 405 Processor Block Reference Guide, Embedded Development Kit, June 5, [6] XILINX, Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Complete Data Sheet, November 5, [7] Steven J. E. Wilton, Noha Kafafi, James C. H. Wu,, Kimberly A. Bozman,Victor O. Aken Ova, and Resve Saleh, Design Considerations for Soft Embedded Programmable Logic Cores [8] Ludovic L Hours, Generating Efficient Custom FPGA Soft-Cores for Control-Dominated Applications, Proceedings of the16th International Conference on Application-Specific Systems, Architecture and Processors (IEEE ASAP 05). [9] [10] [11] Vittorio Cortellessa, Pierluigi Pierini, and Daniele Rossi, Integrating Software Models and Platform Models for Performance Analysis, IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 33, NO. 6, JUNE 2007, pp

74 Correction des erreurs systématiques de l odomètre et suivi de trajectoires sur un robot mobile industriel type tricycle Meftah GHRISSI, Slim Ben SAOUD, Nabil LITAYEM LECAP Institut National des Sciences Appliquées et de Technologies Ecole Polytechnique de Tunisie SRD2I Société de Recherche et Développement en Ingénierie Industrielle [email protected] [email protected] [email protected] Mots Clés Robotique mobile, commande, systèmes embarqués, odomètre, suivi de trajectoires. Résumé Pour ces besoins industriels, la société SRD2I a développé un robot type tricycle fonctionnant en mode filoguidé. Avec l évolution des besoins, nous avons été amené à le rendre autonome. Dans ce papier nous présentons les travaux qui ont étés effectués sur ce mode de fonctionnement ainsi que les résultats d expérimentations obtenus. A savoir, les problèmes liés au positionnement du robot et le suivi de trajectoire. Pour le positionnement nous posons réellement les problèmes liés à celui-ci ainsi que les solutions apportées. Quant au suivi de trajectoire nous présentons les algorithmes développés et implémentés sur le Robot HERCULE appuyés par des résultats d expérimentations. 1

75 Correction des erreurs systématiques de l odomètre et suivi de trajectoires sur un robot mobile industriel type tricycle Meftah GHRISSI, Slim Ben SAOUD, Nabil LITAYEM LECAP Institut National des Sciences Appliquées et de Technologies Ecole Polytechnique de Tunisie SRD2I Société de Recherche et Développement en Ingénierie Industrielle [email protected] [email protected] [email protected] Résumé. Pour ces besoins industriels, la société SRD2I a développé un robot type tricycle fonctionnant en mode filoguidé. Avec l évolution des besoins, nous avons été amené à le rendre autonome. Dans ce papier nous présentons les travaux qui ont étés effectués sur ce mode de fonctionnement ainsi que les résultats d expérimentations obtenus. A savoir, les problèmes liés au positionnement du robot et le suivi de trajectoire. Pour le positionnement nous posons réellement les problèmes liés à celui-ci ainsi que les solutions apportées. Quant au suivi de trajectoire nous présentons les algorithmes développés et implémentés sur le Robot HERCULE appuyés par des résultats d expérimentations. Mots clés Robotique mobile, commande, systèmes embarqués, odomètre, suivi de trajectoires 1. Introduction, L utilisation des AGV (Automated Guided Vehicle) est aujourd hui d un grand intérêt dans plusieurs domaines allant des applications industrielles jusqu aux utilisations domestiques pour les personnes âgées et handicapées. En effet, leurs utilisations permettent d améliorer la productivité des usines et de faciliter certaines tâches pour l homme. Un AGV est composé essentiellement du véhicule proprement dit, du système de contrôle embarqué, du système de guidage et du superviseur à distance avec son interface utilisateur. Dans cet article, nous nous intéressons plus particulièrement à la commande de ce robot en vue de gérer ses déplacements pour effectuer les tâches qui lui sont confiées. Deux problèmes essentiels à gérer, le positionnement du robot dans son environnement et le suivi d une trajectoire définie par l utilisateur. 2

76 Le positionnement est un problème classique de la robotique mobile. En effet, un robot mobile, devant naviguer dans son environnement, doit être capable de connaître sa position dans cet environnement. On distingue deux types, relatif et absolue. Le positionnement relatif ou odomètre est calculé à partir de données issues de capteurs internes du robot (capteurs incrémentaux installés sur les roues). La position et l orientation du robot sont calculées dans un repère aléatoire (sa position physique lors de sa mise sous tension). Ce positionnement relatif n est valable que sur des courtes distances à cause de son imprécision. Un positionnement absolu (position et orientation du mobile dans un repère lié à l environnement) est nécessaire pour recaler la position odométrique à des intervalles de distances bien déterminés. Ce mode de positionnement fait appel à des équipements très coûteux (mesure de temps de vol sur des balises fixe puis triangulation) [1]-[4] Quant au problème du déplacement des robots mobiles, il est souvent traité en deux étapes. La première consistant à planifier un chemin exécutable par le robot, c'est-à-dire satisfaisant les contraintes cinématiques, telles que la contrainte de roulement sans glissement ou des contraintes de courbure. La seconde consistant à élaborer des lois de commande en boucle fermée sur l état courant du robot, permettant de suivre le chemin planifié ou d exécuter une trajectoire de consigne. Cette approche en deux étapes diffère de l approche purement réactive dans laquelle le robot utilise un système de perception lui fournissant des informations sur l environnement, à partir desquelles il définit son déplacement immédiat de manière dynamique [5]-[18]. Dans cet article nous nous contentons d exposer les travaux ayant été effectués sur ces deux axes et qui ont été implémentés et validé sur une plateforme tricycle nommé HERCULE, à caractère industriel, développé par la société SRD2I. L objectif étant de faire évoluer le mode de fonctionnement de ce véhicule du filoguidé vers un fonctionnement autonome sans fil. Enfin, nous exposerons la suite des travaux qui restent à développer pour rendre ce véhicule entièrement autonome. 2- Présentation du véhicule HERCULE : Notre plateforme est du type tricycle avec une roue motrice directrice à l avant et deux roues libres à l arrière. La roue motrice directrice est asservie en traction et en direction par des moteurs à courant continue associés à des réducteurs mécaniques. Ce robot est alimenté par des batteries au plomb 36VDC, 390AH permettant 10h environ d autonomie. Les deux codeurs incrémentaux arrière sont utilisés pour le calcul de l odomètre. Le pilotage est assuré par une carte électronique à base de Roue avant Moteur de traction Codeur gauche Moteur direction Codeur droit de Gyroscope Roue libre gauche Roue libre droite Fig 2.1 : Le Véhicule HERCULE 3

77 microcontrôleur RISC 32 bits type Motorola MPC555. Cette carte assure les asservissements de la roue motrice/directrice, la génération de mouvements, la gestion d entrés/sorties et l interface avec l utilisateur. Un capteur magnétique monté sur la roue avant servant au guidage par fil du véhicule. Un PC portable est connecté sur cette carte par interface RS232 pour l acquisition et l interprétation de l ensemble des données stockées sur la carte de contrôle. 3. Problèmes liés à l odomètre 3.1- Equation de l odomètre : L odomètre (position et orientation du mobile dans son repère local), est calculé à partir des informations des deux codeurs montés sur les deux roues arrière. A chaque pas d échantillonnage les impulsions codeurs sur chaque roue sont transformées en déplacement linéaire selon la relation suivante : Cm=πDn/nCe (3.1) Avec : Cm : facteur de conversion impulsions codeur déplacement linéaire Dn : diamètre des roues (mm) Ce : résolution du codeur (impulsions/révolution) n : rapport de réduction mécanique L incrément de distance linéaire sur chaque roue à chaque pas d échantillonnage i est donné par la relation suivante : U (droite/gauche),i = Cm.N (droite/gauche),i (3.2) Avec N étant le nombre d impulsion codeurs réalisées Le déplacement linéaire du centre instantané de rotation est : U i = ( U droite,i + U gauche,i )/2 (3.3) L incrément de l orientation est obtenu par la relation : θ i =( U droite,i - U gauche,i )/e (3.4) Avec e : entraxe des deux roues arrière Ainsi la position et l orientation du mobile sont obtenues par intégration dans le temps de ces incréments Les erreurs odométriques : On distingue deux types d erreurs [3], a- les erreurs systématiques Sont celles qui sont engendrées par les incertitudes des données utilisées pour le calcul de la position et l orientation. D après les relations 3.1 et 3.4, les incertitudes sur le diamètre des roues et l entraxe e provoquent une grave accumulation d erreur sur la position et l orientation du mobile. b- les erreurs non systématiques. Sont celles qui sont provoquées par l interaction des roues avec le sol, à savoir : - sol accidenté - roulage sur des objets - Glissement de roues (dérapage, blocage mécanique instantané de roues) Ces erreurs peuvent avoir moins d importance que les erreurs systématiques sur un sol parfait. Cependant en milieux industriels, nous devons apporter une solution à ce type d erreurs. Nous nous sommes inspiré de la méthode expérimentale UMBmark (University of Michigan Benchmark) [2] pour corriger ces erreurs systématiques. Cette méthode consiste à faire parcourir au mobile une trajectoire carrée de côté égale à 4 mètres 5 fois dans un sens et 5 fois dans le sens contraire puis de mesurer à chaque fois la position absolue d arrivée par rapport à un mur de référence. Ces positions seront comparées par la suite à la position donnée par l odomètre. Par de simples relations géométriques on peut déduire les valeurs exactes des diamètres des roues et de l entraxe e. Les résultats de cette expérimentation sur le robot HERCULE sont donnés sur le tableau 3.1 Avant correction Après correction Entraxe e [mm] Diamètre roue droite [mm] Diamètre roue gauche [mm] Tableau 3.1 : Résultats expérimentaux 4. Suivi de trajectoires Pour ce qui concerne le suivi de trajectoire, nous avons développé puis implémenté deux algorithmes de génération de trajectoire sur le véhicule HERCULE [4]. Le premier algorithme est basé sur 4

78 un retour d état linéarisant et le second sur une approche Lyapunov. Les résultats d expérimentation et de validation de ces deux algorithmes sont les suivants : Résultats expérimentaux Pour la validation du premier algorithme nous avons appliqué une loi de vitesse trapézoïdale qui consiste en, une phase de démarrage de 0.8 m, une phase de vitesse constante (0.6m/s) et une phase de décélération sur 0.8m avant le point final pour préparer l arrêt. Pour le deuxième algorithme nous avons imposé une variation trapézoïdale de la vitesse sur chaque segment de la trajectoire. En effet, vu les fortes commandes envoyées par cet algorithme il n est pas possible de négocier un virage avec des vitesses élevées. La figure ci-dessous présente le principe de variation de la vitesse V(m/s) P 0 P 1 P N-1 P N Figure 4.1 : Variation de vitesse pour le second algorithme Nous avons considéré deux types de trajectoire pour valider cet algorithme : a. Essai 1 : suivi d une trajectoire trapézoïdale Avec cet essai, nous avons validé notre algorithme pour une trajectoire trapézoïdale composée de 6 segments et d angles de degrés différents. b. Essai 2 : suivi d une trajectoire en forme de Z La figure 4.1 exprime les résultats de cette expérimentation. En effet, la rangé a exprime les résultats du premier algorithme et les rangés b et c ceux du deuxième algorithme. Pour chaque rangé la colonne 1 montre la trajectoire du robot et la rangé 2 celle du centre de la roue arrière. n point nulles jusqu'à que celui-ci rattrape la nouvelle droite. C est de cette façon que l algorithme négocie les virages. Sur la courbe 14.1 b nous remarquons un suivi de chemin presque parfait avec l algorithme basé sur l approche de Lyapunov. A travers l essai fait sur la trajectoire en Z nous mettons en évidence l avantage considérable que présente l algorithme basé sur l approche de Lyapunov par rapport à celui basé sur la linéarisation. En effet avec cet algorithme nous n aurons plus de limitations sur l angle de virage. 5. Conclusion Dans la première partie de ce papier, nous avons commencé par présenter le robot HERCULE, sa structure mécanique, ses capteurs et la partie contrôle commande. Nous avons par la suite présenté une solution pour éliminer les erreurs systématiques de l odomètre ainsi que les résultats d expérimentation. Reste à résoudre le problèmes des erreurs nous systématiques, chose qui va être planifié par la suite. En ce qui concerne le suivi de trajectoire les deux algorithmes développés donnent des résultats satisfaisants. En fonction des contraintes liées à l environnement dans lequel évolue le robot, nous choisirons entre les deux algorithmes. Par exemple, si des obstacles seront à éviter sur le parcours du robot, on choisira le deuxième algorithme compte tenu qu il permet de coller au mieux à la trajectoire définie. La suite de nos travaux sera scindée en deux parties : 1- la correction des erreurs non systématique de l odométre et le positionnement absolu du robot dans son environnement. 2- le développement d un système de supervision et de gestion composante essentielle à l intégration de ce robot dans le milieu industriel. Sur les courbes 14.1 a, nous confirmons la présence de la distance de sécurité calculée d une manière dynamique par l algorithme de commande. En effet, dés que le robot dépasse la distance de sécurité l algorithme génère des angles de commande non 5

79 Position initiale Position finale (a) (b) (c) Figure 14.1 : Trajectoires du robot et celles du centre des roues arrière obtenues par les deux algorithmes 6

80 Références [1] S. Ben Saoud, L. Nciri, and M. Ghrissi, Path-tracking and parking manoeuvre control of an industrial tricycle robot, International Journal of Robotics and Automation, Vol. 20, No. 4, pp , 2005 [2] Borenstein, J., 1997, "Experimental Results from Internal Odometry Error Correction With the OmniMate Mobile Platform." IEEE Transactions on Robotics and Automation, Vol. 14, No. 6, pp , December 1998 [3] Borenstein, J. and Feng. L., 1995, "UMBmark: A Benchmark Test for Measuring Odometry Errors in Mobile Robots." Proceedings of the 1995 SPIE Conference on Mobile Robots, Philadelphia, October 22-26, 1995 [4] Pomiers (P.), Grissi (M.) et Semerano (A.) Outdoor navigation strategy in hazardous environments. Actes de CLAWAR 99 : 2 nd International Conference on Climbing and Walking Robots. [5] A. Tayebi, M. Tadjine and A. Rachid "Quasi-continus output feedback control for nonholonimic systems in chained form", Laboratoire des Systèmes Automatiques, Université de Picardie-Jules Verne, AMIENS, FRANCE. [6] A. De Luca1, G. Oriolo1 and C. Samson, "Feedback Control of a Nonholonomic Car-Like Robot", Lectures Notes in Control and Information Sciences 229. Springer, ISBN , 1998, 343p. [7] Alexis Scheuer, "Planification de chemins à courbure continue pour robot mobile non-holonome", Thèse, Institut national polytechnique de Grenoble, [8] André KAMGA and Ahmed RACHID, "A simple path tracking controller for car-like mobile robots", Laboratoire des Systèmes Automatiques, AMIENS, FRANCE. [12] Kane Usher, Peter Ridley et Peter Croke, "Visual Servoing of a Car-Like An Application of Omnidirectional Vision", Proc, 2002 Australasian Conference on Robotics and Automation, Auckland, November [13] L. García-Pérez, M.C. García-Alegre, A. Ribeiro, D. Guinea, "Fuzzy control for an approaching-orienting maneuver with a car-like vehicle in outdoor environments", Instituto Automática Industrial (IAI), Consejo Superior de Investigaciones Científicas (CSIC), Madrid [14] M. Egerstedt, X. Hu, and A. Stotsky, "Control of Mobile Platforms Using a Virtual Vehicle Approach", IEEE Transactions on automatic control, VOL. 46, NO. 11, NOVEMBER 2001 [15] Maher Khatib, "Contrôle du mouvement d'un robot mobile par retour sensoriel", LAAS-CNRS, Toulouse, France, 1996 [16] Philippe Garnier and Thierry Fraichard, "A Fuzzy Motion Controller for a Car-Like Vehicle", Rapport de recherche, INRIA, N 3200, Juin 1997 [17] Roger PISSARD-GIBOLLET, Patrick RIVES, "Asservissement visuel appliqué à un robot mobile : Etat de l art et modélisation cinématique", Unité de recherche INRIA-SOPHIA ANTIPOLIS N 1577 décembre 1991 [18] Sampai M. at al, "Arbitrary path tracking control of articulated vehicles using nonlinear control theory". IEEE Transactions. On control systems technology. Vol 3, pp , (1995). [9] Bryan Nagy and Alonzo Kelly, "Trajectory generation for car-like robots using cubic curvature polynomials", Robotics Institute, Carnegie Mellon University, Pittsburgh, PA , Helsinki, Finland, June 11, 2001 [10] J.P. Laumond, S. Sekhavat and F. Lamiraux, "Guidelines in Nonholonomic Motion Planning for Mobile Robots", Lectures Notes in Control and Information Sciences 229. Springer, ISBN , 1998, 343p [11] José Castro1, Vitor Santos2, M. Isabel Ribeiro1, "A Multi-Loop Robust Navigation Architecture for Mobile Robots", IEEE International Conference on Robotics and Automation, ICRA 98, Leuven, Belgique, May

81 Etude Comparative des moyens de communications inter processeurs dans les architectures MPSoC Nabil LITAYEM, Meftah GHRISSI, Bilel FITOURI, Slim BEN SAOUD Institut National des Sciences Appliquées et de Technologies INSAT Centre Urbain Nord BP Tunis Cedex LECAP - EPT [email protected] [email protected] [email protected] Résumé. Les exigences des traitements embarqués augmentent à un rythme exponentiel, ce qui favorise la conception de systèmes multiprocesseurs pour résoudre les problèmes de complexité trouvée au niveau des systèmes monoprocesseur. L utilisation de plateforme FPGA est une des solutions envisageables vue la disponibilité de plusieurs processeurs Soft-Core et Hard-Core pouvant être utilisés et connectés en architectures multiprocesseurs homogènes ou hétérogènes selon plusieurs types d interconnexion. Cet article propose l évaluation de la communication interprocesseur dans un ensemble de solutions architecturales parallèles utilisant une communication inter-processeurs basée sur la mémoire partagée. La totalité de ces études sera effectuée sur une plateforme ML310 de XILINX incluant l FPGA Virtex-II pro. Mots clés: Systèmes embarqués, Processeur Soft-Core, Processeur Hard-Core, FPGA, Microblaze, PowerPC, Benchmark, MPSOC.

82 1 Introduction Un système embarqué est un système de traitement de l information dans un produit plus grand. Contrairement à n importe quel autre système, il exécute de manière répétitive le même programme et obéit à un ensemble de contraintes sévères tel le coût, la consommation, la taille et la performance. De plus, dans la plupart des cas, c est un système réactif qui doit effectuer les calculs en temps réel. La performance des systèmes embarqués est un des aspects critiques qui doit être pris en compte pendant la phase de conception. Les solutions logicielles telles que les applications audio/vidéo, encodage/décodage, de traitement d'images [1] ou les applications réparties demandent à la fois de la précision ainsi qu un moindre temps d exécution [2], ceci a donné naissance aux systèmes multiprocesseurs ou MPSOC. Plusieurs approches on été proposées pour l évaluation des performances des architectures MPSOC [3][4]. Notre approche consiste à proposer un ensemble d architectures parallèles (homogènes et hétérogènes), puis de mesurer leurs performances du point de vue communication. Cet article est organisé comme suit : La section 2 est consacrée à la présentation de la plateforme matérielle utilisée ainsi que les processeurs étudiés. Cela est suivi dans la section 3 par la présentation des architectures parallèles utilisées ainsi que l évaluation de leurs performances. Une discussion est proposée à la section 4. Enfin une conclusion et quelques perspectives sont proposées dans la section 5. 2 Présentation de la plateforme matérielle et des processeurs utilisés 2.1 Présentation de la plateforme matérielle : La plateforme matérielle utilisée est la carte ML310 de XILINX illustrée par la Figure1. Elle est constituée autour d un FPGA Virtex-II pro qui peut être muni d un ou de deux processeurs PowerPC d IBM cadencés à 400MHz (selon la gamme) [5]. Dans le cas de la ML 310 le Virtex-II pro inclut deux processeurs PowerPC, la carte est aussi équipé d un large éventail de périphériques (interface PCI, Compact Flash, Mémoire DDR, interfaces IDE, interface Ethernet )

83 Fig 1. Plateforme FPGA XILINX Virtex-II pro 2.2 Processeur hard-core PowerPC 405: Le processeur PowerPC 405 illustré par la Figure 2, est un processeur RISC 32 bits de la firme IBM. Il est essentiellement dédié pour les applications embarquées [6]. Dans les plateformes FPGA Virtex-II pro ce processeur est gravé au niveau silicium, mais il peut être utilisé en combinaison avec les autres modules matériels conçu et synthétisé sur la plateforme FPFA. Fig2. Architecture du processeur PowerPC

84 2.3 Processeur Soft-Core MicroBlaze Le processeur Soft-Core MicroBlaze illustré par la Figure 3 est un processeur RISC optimisé pour l implémentation sur des FPGA XILINX. C est un processeur hautement configurable. Sa configuration est une tache très simple puisqu elle s effectue au niveau de l environnement de développement EDK et permet à l utilisateur de le paramétrer au moment de la conception du schéma logique. Il occupe entre 900 et 2600 cellules logiques (LUT 1 ) et atteint une fréquence de 80 à 180 MHz selon la plateforme (Spartan, Virtex...) et les options souhaitées [7]. La simplicité de la configuration et la faible surface de silicium font de ce processeur un choix très intéressant par rapport à d autres Soft-Cores (LEON 3[8], Open Risc[9]) notamment sur les plateformes XILINX. IXCL_M IXCL_S IOPB ILMB Instruction-side bus interface D - c a c h Bus IF Program Counter Instruction Buffer Special Purpose Registers Instruction Decode ALU Shift Barrel Shift Multiplier Divider Register File 32 X 32b Data-side bus interface I- c a c h Bus IF DXCL_M DXCL_S DOPB DLMB MFSL 0..7 Paramètres optionnels SFSL 0..7 Fig3. Architecture du processeur MicroBlaze 3 Conception et évaluation 3.1 MPSOC avec le bus FSL Le processeur soft-core MicroBlaze comporte 8 liens d entrées/sorties FSL[10](Fast Simplex Link). Le bus FSL est un moyen rapide de communication entre le processeur et les autres entités. Une seule interface est nécessaire pour la liaison de deux processeurs. Les communications sur les liens FSL se font simplement grâce à des instructions prédéfinies. 1 Look-Up-Table

85 BRAM BRAM ILMB DLMB DLMB ILMB MB_0 FSL LINK MB_1 OPB BUS RS232 Timer Fig4. Bus FSL 3.2 MPSoC avec le bus partagé Un des moyens les plus classiques et provenant d une architecture monoprocesseur, le bus partagé permet aussi de partager deux processeurs ou plus (jusqu à 16 pour le bus OPB[11]). Une mémoire partagée doit d être connectée sur le bus partagé pour servir de support de stockage des données communes aux deux processeurs. BRAM BRAM ILMB DLMB DLMB ILMB MB_0 MB_1 OPB bus RS232 OPB_TIM OPB_BRAM Fig5. Bus partagé 3.3 MPSoC avec la mémoire à doubles ports Etant donné que tous les blocs BRAM de XILINX sont des mémoires à deux ports, leur utilisation au sein de l environnement EDK devient alors très facile. Un bloc mémoire doit être relié à des contrôleurs avant qu il ne soit directement connecté sur le bus. Il existe cependant plusieurs types de contrôleurs (OCM, PLB, OPB, LMB...)[12], ce qui permet de réaliser plusieurs implémentations possibles d architectures utilisant les mémoires dual ports.

86 BRAM BRAM BRAM ILMB DLMB DLMB ILMB MB_0 MB_1 OPB bus OPB bus RS232 OPB_TIMER Fig6. Mémoire à deux ports 3.4 Evaluation des MPSoC Outre la configuration matérielle, plusieurs autres facteurs peuvent avoir un effet non négligeable sur les performances des MPSoC. L environnement de développement EDK offre la possibilité, une fois le système conçu, de modifier une multitude de paramètres logiciels en fonction des besoins de l utilisateur. Ces facteurs incluent entre autre le compilateur choisi, le niveau d optimisation de la compilation, la fréquence du bus et du processeur supportée sur la carte cible. Le tableau 1 résume les paramètres logiciels fixés pour tous les MPSoC précédemment conçus. Tableau 1. Paramètres logiciels MicroBlaze et PowerPC MicroBlaze PowerPC Horloge 100MHz 100MHz Bus 100MHz 100MHz Compilateur mb-gcc powerpc-eabi-gcc Niveau de compilation Niveau 2 Niveau 2 Un seul et même programme a été utilisé afin d évaluer les MPSoC créés (avec un petit changement au niveau de l architecture ayant recours aux bus FSL). L idée est de calculer le temps de transfert d un certain nombre variable de données du processeur maître au processeur esclave. Le calcul du temps d exécution de l algorithme a été effectué principalement avec le périphérique OPB_TIMER en cycles d horloge (une mesure de cycles correspond ainsi à une seconde d exécution). Le tableau 3 illustre les différentes architectures duo processeurs implémentées selon les types d interconnexion préalablement décris ainsi que le temps d exécution de l algorithme de transfert de 100, 200, 500 et 1000 données. Pour la mémoire à double port, quatre versions ont été conçues en fonction des mémoires fournies par la carte d expérimentation ML310. Remarques a- Les architectures implémentées suivent une certaine analogie du moment que les processeurs sont différents (Soft-Core et Hard-Core) et que par conséquent, chacun

87 d entre eux utilise ses propres bus et contrôleurs. Le tableau II décrit l analogie entre les deux processeurs PowerPC et MicroBlaze. Tableau 2. Analogie PowerPC-MicroBlaze Périphérique MicroBlaze PowerPC Contrôleur Mémoire interne LMB OCM Contrôleur Mémoire externe OPB PLB Bus OPB PLB Le contrôleur de mémoire est une interface qui permet de relier un bloc mémoire BRAM à un bus du processeur. Nous nous proposons dans tout le reste de cet article d utiliser la terminologie mémoire pour désigner un contrôleur mémoire. Exemple : une mémoire LMB signifie un bloc mémoire BRAM relié à un contrôleur mémoire LMB. b- Deux mesures n ont pas pu être calculées pour l architecture duo PowerPC au niveau de la dernière version, ceci est du à la taille limitée que peut adresser le bus DOCM 2 à un seul bloc mémoire. Tableau 3. Résultats de l exécution du programme Description Maitre Esclave MM ME MP FSL MB MB LMB LMB FIFO (FSL) Shared bus Version 1 MB MB LMB LMB OPB PPC PPC OCM OCM PLB MB MB OPB OPB dual OPB PPC PPC PLB PLB dual PLB MB MB OPB OPB dual LMB Version 2 PPC PPC PLB PLB dual OCM MB PPC OPB PLB dual OCM/LMB PPC MB PLB OPB dual OCM/LMB MB MB LMB OPB dual LMB Version 3 PPC PPC OCM PLB dual OCM MB PPC LMB PLB dual OCM/LMB PPC MB OCM OPB dual OCM/LMB MB MB LMB LMB dual LMB Version 4 PPC PPC OCM OCM dual OCM MB PPC LMB OCM dual OCM/LMB X X PPC MB OCM LMB dual OCM/LMB MM: Mémoire Maître ME: Mémoire Esclave MP: Mémoire Partagée 2 Data On Chip Memory bus

88 MB: MicroBlaze PPC: PowerPC 4 Discussion : En essayant d exploiter les valeurs indiquées sur le tableau 3, on déduit que : a) les valeurs d une même ligne sont proportionnelles, ceci prouve que les mesures prises sont bel et bien celles correspondant au temps de transfert des données ; b) les valeurs varient entre 1000 et cycles d horloges, ce qui correspond à un temps compris entre 10 et 750 nanosecondes (pour une fréquence de 100MHz). Cette marge est logique vu les performances des deux processeurs PowerPC et MicroBlaze et les latences des différentes mémoires utilisées ; c) la 4 ème version ou l on utilise une mémoire à deux ports LMB et OCM respectivement pour les processeurs MicroBlaze et PowerPC est celle qui possède les meilleurs résultats vu que les mémoires utilisées dans ce système sont des mémoires à grande latence contrairement à celles connectées sur le bus OPB ou PLB qui sont plus lentes; d) le bus partagé, méthode classique d implémentation d architecture multiprocesseur semble être assez performante d après les valeurs indiquées sur le tableau III. Ceci n exclut pas le fait que cette technique est la plus déconseillée pour des architectures à plus de deux processeurs vu le goulot d étranglement que ça peut créer sur le bus partagé [4] ; e) les valeurs obtenues nous permettent de classer (dans la plupart des cas) les architectures par ordre de performance comme suit : Architecture Duo PowerPC ; Architecture PowerPC MicroBlaze (dite hétérogène); Architecture Duo MicroBlaze. Le soft-core MicroBlaze ne peut pas, dans la plupart des cas, remplacer les performances d un hard-core tel que PowerPC. Le fait de remplacer un MicroBlaze par un PowerPC dans une architecture hétérogène en est bien la preuve ; f) pour les architectures hétérogènes des versions 2 et 4, nous avons juste inversé les rôles des processeurs en inversant les programmes assignés à chacun d entre eux, nous n avons pas besoin de refaire une autre architecture matérielle (en suivant le tableau d analogie II contrairement à la 3 ème version). Les résultats ainsi obtenus ne sont pas les mêmes car les programmes maître et esclave sont différents, ce qui veut dire que le nombre d accès mémoire n est pas le même ; g) Pour la version une, il n existe pas d architecture hétérogène, dans ce cas de figure et en suivant le tableau d analogie II, la mémoire dual ferait intervenir un contrôleur OPB et un contrôleur PLB, cette configuration est impossible car la largeur du connecteur PLB est de 64 bits contrairement au port du bloc

89 mémoire BRAM qui est de 32 bits, ce qui rend impossible l implémentation d une telle architecture ; h) le bus FSL est visiblement le meilleur moyen de communication entre deux MicroBlaze, cette solution n est pas flexible car la mémoire FIFO n est pas assez grande de taille et de ce fait, le transfert de données doit être effectué avec les méthodes bloquantes et non bloquantes proposées avec le pilote du bus FSL avec précaution. 5 Conclusion et perspectives Cet article propose des solutions d implémentation et de mesure de performance de systèmes multiprocesseurs sur une plateforme FPGA Virtex-II Pro. Différents types d interconnexion ont été adoptés à cette fin à savoir le bus OPB et PLB partagé et les interfaces FSL fournies avec le processeur MicroBlaze. Les blocs BRAM de XILINX sont, par défaut, des mémoires à deux ports. Ceci a permis également de concevoir des architectures parallèles grâce à l intégration de tels blocs. Toutes ces interconnexions utilisent une communication à base de mémoire partagée. Dans une deuxième étape et ayant pour objectif d évaluer les performances des systèmes conçus, nous avons introduit un algorithme parallèle et essayé de mesurer son temps d exécution sur les MPSOC crées. Les facteurs les plus importants sur lesquels il est possible d agir pour améliorer les performances d une architecture parallèle implémentée sur circuit FPGA tel que Virtex-II Pro sont principalement le choix du processeur (soft-core ou hard-core), les mémoires, et les moyens de communication. Ce travail peut être généralisé pour un nombre plus grand de processeurs softcore MicroBlaze (car le circuit n intègre que deux cœurs de processeurs hard-core PowerPC) et l utilisation de benchmarks peut s avérer utile afin d évaluer la performance de nos systèmes multiprocesseurs pour des domaines d application précis de façon adéquate. On peut aussi compléter notre étude par l évaluation de la couche logicielle notamment le choix de RTOS [13] et son impact sur les performances globales du système. Enfin l aspect consommation d énergie doit être pris en compte, car en effet le moyen de communication entre les processeurs a un effet non négligeable sur la consommation du système [14]. Références 1. Randall S. Janka, Senior Member, IEEE, Linda M. WillsVirtual, Lewis B. Baumstark, Jr., Benchmarking and Model Continuity in Prototyping Embedded Multiprocessor Signal Processing Systems, IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 28, NO. 9, SEPTEMBER Guilin Chen, Guangyu Chen, Ozcan Ozturk, and Mahmut Kandemir, Exploiting Inter- Processor Data Sharing for Improving Behavior of Multi-Processor SoCs, Proceedings of the IEEE Computer Society Annual Symposium on VLSI New Frontiers in VLSI Design, Kai Richter, Marek Jersak, Rolf ErnstA, Formal Approach to MpSoC Performance Verification, IEEE Computer Society, 2003.

90 4. Yeliang Zhang, Vinod Tipparaju, Jarek Nieplocha, Salim Hariri, Parallelization of the NAS Conjugate Gradient Benchmark Using the Global Arrays Shared Memory Programming Model, Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, XILINX, Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Complete Data Sheet, November 5, XILINX, PowerPC 405 Processor Block Reference Guide, Embedded Development Kit, June 5, Xilinx, MicroBlaze Processor Reference Guide, Embedded Development Kit, XILINX, Fast Simplex Link (FSL) Bus (v2.10a),ds449 Nov 2, XILINX, On-Chip Peripheral Bus V2.0 with OPB Arbiter (v1.10c), DS401 August 31, Bas Breijer, Filipa Duarte, and Stephan Wong, An OCM based shared memory controller for VIRTEX 4, IEEE Shinya Honda, Hiroyuki Tomiyama, Hiroaki Takada, RTOS and Codesign Toolkit for Multiprocessor Systems-on-Chip, IEEE Mirko Loghi, Massimo Poncino, Exploring Energy/Performance Tradeoffs in Shared Memory MPSoCs: Snoop-Based Cache Coherence vs. Software Solutions Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE 05).

91 Etude de l Influence de la Communication Processeur Memoire sur la Performance d un SoC à base de circuit FPGA XILINX Nabil LITAYEM, Meftah GHRISSI, Bilel FITOURI, Slim BEN SAOUD Institut National des Sciences Appliquées et de Technologies INSAT Centre Urbain Nord BP Tunis Cedex LECAP - EPT [email protected] [email protected] [email protected] Mots Clefs Processeur Soft Core, Processeur Hard Core, Communication processeur mémoire, FPGA, Systèmes embarqués, Benchmark. Résumé Les exigences des traitements embarqués augmentent à un rythme exponentiel, l offre en terme de processeurs embarqués devient de plus en plus large vue la multitude de choix offert aux concepteurs, les critères de choix sont aussi très variés (performances, cout, consommation, outils liés ). Pour les plateformes FPGA on dispose actuellement de deux catégories de processeurs embarqués, les processeurs Soft Cores fournit sous forme de code source HDL ou sous forme de netliste et les processeurs Hard Cores gravés au niveau silicium sur plusieurs plateformes FPGA. A chaque processeur est lié un ensemble de moyens permettant de l interconnecter avec sa mémoire. Dans cet article on propose une étude de l influence de la communication processeur mémoire sur la performance d un SoC, Dans cette étude on utilisera deux processeurs embarqué, MicroBlaze et PowerPC sur une plateforme FPGA Virtex-II pro de XILINX. 1

92 Etude de l Influence de la Communication Processeur Memoire sur la Performance d un SoC à base de circuit FPGA XILINX Nabil LITAYEM, Bilel FITOURI, Meftah GHRISSI, Slim BEN SAOUD Institut National des Sciences Appliquées et de Technologies INSAT Centre Urbain Nord BP Tunis Cedex LECAP - EPT [email protected] [email protected] [email protected] Résumé Les exigences des traitements embarqués augmentent à un rythme exponentiel, l offre en terme de processeurs embarqués devient de plus en plus large vue la multitude de choix offert aux concepteurs, les critères de choix sont aussi très variés (performances, cout, consommation, outils liés ). Pour les plateformes FPGA on dispose actuellement de deux catégories de processeurs embarqués, les processeurs Soft-Cores fournit sous forme de code source HDL ou sous forme de netliste et les processeurs Hard-Cores gravés au niveau silicium sur plusieurs plateformes FPGA. A chaque processeur est lié un ensemble de moyens permettant de l interconnecter avec sa mémoire. Dans cet article on propose une étude de l influence de la communication processeur mémoire sur la performance d un SoC, Dans cette étude on utilisera deux processeurs embarqué, MicroBlaze et PowerPC sur une plateforme FPGA Virtex-II pro de XILINX. Mots Clefs Processeur Soft Core, Processeur Hard Core, Communication processeur mémoire, FPGA, Systémes embarqués, Benchmark. I. INTRODUCTION La performance des systèmes embarqués est un des aspects critiques qui doit être pris en compte pendant la phase de conception [1], [2]. Les solutions logicielles telles que les applications audio/vidéo, encodage/décodage, de traitement d'images ou les applications réparties demandent à la fois de la précision ainsi qu un moindre temps d exécution. Le taux d intégration croissant à permis d intégrer sur la même puce des systèmes complet nommés SoC [3]. La communication processeur mémoire, comme pour tout système informatique est un point clé qui influe sur les performances globales d un SoC. Dans cet article on présente une étude comparative des moyens de communication processeur mémoire et cela pour les deux processeurs MicroBlaze [4] et PowerPC 405 [5], afin de révéler l influence des moyens de communications sur les performances globales d un SoC construit sur une plateforme FPGA de XILINX. Cet article est organisé comme suit : La section 2 est consacrée à la présentation de la plateforme matérielle utilisée ainsi que les processeurs étudiés. Cela est suivi dans la section 3 par la présentation 2

93 des résultats obtenus par les deux benchmarks exécutés sur les architectures étudiés. Une discussion est proposée à la section 4. Enfin une conclusion et quelques perspectives sont proposées dans la section 5. II PRESENTATION DE LA PLATEFORME MATERIELLE ET DES PROCESSEURS UTILISES A. Présentation de la plateforme matérielle : La plateforme matérielle utilisée est la carte ML310 de XILINX illustrée par la Figure1. Elle est constituée autour d un FPGA Virtex-II pro [6] qui peut être muni d un ou de deux processeurs PowerPC d IBM cadencés à 400MHz (selon la gamme) [5]. Dans le cas de la ML 310 le Virtex-II pro inclut deux processeurs PowerPC, la carte est aussi équipé d un large éventail de périphériques (interface PCI, Compact Flash, Mémoire DDR, interfaces IDE, interface Ethernet ) Figure 1. Plateforme FPGA XILINX Virtex-II pro B. Processeur Hard Core PowerPC 405: Le processeur PowerPC 405 illustré par la Figure 2, est un processeur RISC 32 bits de la firme IBM. Il est essentiellement dédié pour les applications embarquées. Dans les plateformes FPGA Virtex-II pro ce processeur est gravé au niveau silicium, mais il peut être utilisé en combinaison avec les autres modules matériels conçu et synthétisé sur la plateforme FPGA. Figure2. Architecture du processeur PowerPC C. Processeur Soft Core MicroBlaze Le processeur Soft-Core [7] MicroBlaze illustré par la Figure 3 est un processeur RISC optimisé pour l implémentation sur des FPGA XILINX. C est un processeur hautement configurable [8]. Sa configuration est une tache très simple puisqu elle s effectue au niveau de l environnement de développement EDK et permet à l utilisateur de le paramétrer au moment de la conception du schéma logique. Il occupe entre 900 et 2600 cellules logiques (LUT 1 ) et atteint une fréquence de 80 à 180 MHz selon la plateforme (Spartan, Virtex...) et les options souhaitées. La simplicité de la configuration et la faible surface de silicium font de ce processeur un choix très intéressant par rapport à d autres Soft-Cores (LEON3 [9], Open Risc [10]) notamment sur les plateformes XILINX. Instruction-side bus interface IXCL_M D - IXCL_S c a c IOPB h e ILMB Bus IF Program Counter Instructi on Buffer Paramètres Special Purpose Registers Instructi on Decode ALU Shift Barrel Shift Register File 32 X 32b Data-side bus interface DXCL_M I- c a DXCL_S c h e DOPB Bus IF DLMB MFSL 0..7 SFSL 0..7 Figure 3. Architecture du processeur MicroBlaze 1 Look-Up-Table 3

94 III EVALUATION DES PERFORMANCES Un des moyens les plus efficaces pour l évaluation des performances [11] d un système à microprocesseur sont les benchmark [12], un benchmark est une application permettant de refléter la performance d un microprocesseur et du compilateur utilisé dans un domaine bien définie. Dans ce qui suit on utilisera les deux benchmarks Stanford et Dhrystone afin de refléter les performances des différentes architectures à étudier, pour toutes les architectures étudiées le compilateur utilisé est GCC. Les architectures étudiées sont différentes combinaison processeurs mémoire en utilisant les divers contrôleurs mémoire de XILINX [13], [14], [15], [16]. A. Evaluation des performances avec le benchmark «Dhrystone» 1. Présentation de «Dhrystone» Dhrystone est un benchmark synthétique permettant de refléter la performance du calcul entier d un système à microprocesseur, il est écrit en langage C, ce qui le rend hautement portable sur différentes architecture. Il est de très faible empreinte mémoire ce qui le rend inadapté pour les architecture PC actuel puisque il ne reflète pas l influence d une grande mémoire toutefois il est encore très adapté pour des systèmes embarqués. 2. Résultats obtenues avec «Dhrystone» Afin de pouvoir évaluer les performances des différentes architecture de mémoire pour les processeurs PowerPC et Microblaze on a exécuté les Benchmark sur les différentes architectures présentées dans le Tableau1. Dans les architectures utilisées avec mémoire cache, la taille du cache est de 16Ko pour les deux processeurs. On devrait toutefois noter que la taille de mémoire cache du Microblaze est configurable et peut atteindre une taille de 64 Ko, par contre pour le PowerPC cette taille ne peut que prendre la valeur 16 Ko. Tableau 1. Résultats obtenus avec le benchmark Dhrystone Processeur Memoire Cache DMIPS PowerPC MicroBlaze PLB NO 22.9 YES 114 OCM NO 84.4 LMB NO 91.4 OPB NO 15.7 SDRAM NO 4.57 YES Analyse des résultats obtenus L exécution du benchmark Dhrystone permet d aboutir aux conclusions suivantes : a) Les performances atteintes par le PowerPC et le Microblaze sont comparable si on utilise des mémoires internes (OCM pour PowerPC et LMB pour Microbaze). b) La meilleure performance atteinte est celle obtenue pour l architecture PowerPC utilisant des mémoires PLB avec 16Ko de cache, à travers cette architecture on peut constater qu un gain impressionnant en performance est obtenue en utilisant de la mémoire cache. c) le processeur MicroBlaze est un Soft Core mais peut atteindre des hautes performances comparables à celles du processeur PowerPC pour une même fréquence de fonctionnement. d) La perte de performance causée par l utilisation d une mémoire externe peut être compensée en utilisant une mémoire cache. B. Evaluation des performances avec le benchmark «STANFORD» 1. Présentation de STANFORD STANFORD est une petite suite de benchmark constituée des programmes suivants : Perm : un programme de permutation récursive 4

95 Towers : Programme de résolution d un problème de tours de Hanoi. Queens : Programme de résolution de huit problèmes de Queen à 50 temps. Intmm : Programme de multiplication de deux matrices entières. Mm : Programme de multiplication de matrices en virgule flottante. Puzzle : algorithme de calcul de bord. Quick : Programme de tri d un tableau utilisant l algorithme Quicksort. Bubble : Programme de tri d un tableau utilisant l algorithme Bubblesort. Tree : Programme de tri d un tableau utilisant l algorithme Treesort. FFT : Programme de calcul de la transformée de fourrier rapide. STANFORD mesure le temps d exécution en milliseconde pour chacun des huit petits programmes inclus dans le benchmark. Deux sommes pondérées sont calculées, la première reflète le temps d exécution pour les programmes en virgule fixe et la deuxième le temps d exécution pour les programmes en virgule flottante. Les coefficients de la somme pondéré sont prédéfinis de façon expérimentale. 2. Résultats obtenues avec Stanford Les mêmes architectures étudiées précédemment mise à part l architecture PowerPC avec un controleur OCM ont été évaluées par le benchmark Stanford. L élimination de architecture (OCM) est due au fait que Stanford ne peut pas être chargé sur une mémoire de ce type vue sa grande taille par rapport à Dhrystone. Le tableau suivant illustre les résultats obtenus avec ce benchmark. Tableau 2. Résultats obtenus avec le benchmark Stanford 3. Analyse des résultats obtenus a) En observant les résultats obtenus l architecture PowerPC avec mémoire PLB utilisant une mémoire cache et l architecture MicroBlaze avec mémoire LMB, on constate qu ils obtiennent des résultats comparables pour des algorithmes simples tels que Perm et Towers. Toutefois les écarts sont impressionnants notamment pour les algorithmes complexes incluant des opérations en virgule flottante tel que Mm et FFT. b) En examinant les résultats obtenus avec les architectures MicroBlaze utilisant une mémoire externe SDRAM, on constate que l utilisation d une mémoire cache améliore les performances de cette architecture et la ramène à des résultats proches de ceux obtenues en utilisant une mémoire LMB, notamment pour les algorithmes utilisant des opérations en virgule flottantes. Ceci nous ramène à conclure que les limitations de performance pour ces architectures est liée aux performances du cœur processeur et non à la communication avec la mémoire. c) les mémoires LMB, qui possèdent une faible latence, sont aussi performantes que les caches données et instructions utilisées avec le processeur PowerPC. Ce résultat apparaît également au niveau de l exécution du benchmark Dhrystone et montre qu un SOC intégrant un MicroBlaze peut parfois dépasser les performances que celui qui intègre un hardcore PowerPC. d) l utilisation d une mémoire externe est déconseillée pour les systèmes se devant d être performantes vu le nombre de cycle élevé de lecture/écriture qu elle consomme. Description Functions Cumul Processeur Mémoire Cache Perm Towers Queens Intmm Mm Puzzle Quick Bubble Tree FFT NFPC FPC PowerPC MicroBlaze PLB NO YES LMB NO OPB NO SDRAM NO YES

96 IV. CONCLUSION ET PERSPECTIVES Ce travail nous a permis d évaluer de près les performances que peut atteindre un système on chip implémenté sur circuit FPGA Virtex-II Pro. On a pu constater qu outre le choix du processeur, la mémoire joue aussi un rôle primordial sur les performances que peut atteindre un SoC. On a aussi constaté que les performances que peut atteindre un Soft Core sont comparables à celle atteinte par un Hard Core notamment si le choix de la mémoire utilisé est effectué de façon adéquate. Toutefois ce travail doit être compléter par une étude abordant la consommation d énergie des systèmes choisi [18]. Le concepteur pourra ainsi choisir de façon astucieuse l architecture à utiliser en prenant en considération les performances du système ainsi que sa consommation. References [12] Yingxu Wang and Hareton K.N.Leung, Benchmark- Based Adaptable Software Process Model, IEEE 2001, pp [13] XILINX, On-Chip Peripheral Bus V2.0 with OPB Arbiter (v1.10c), DS401 August 31, 2006 [14] Bas Breijer, Filipa Duarte, and Stephan Wong, An OCM based shared memory controller for VIRTEX 4, IEEE [15] XILINX, PLB usage in Xilinx FPGA, September [16] XILINX, Embeddesd Systems Tools Reference Manual, January, [17] Ray C.C. Cheung, Dong-U Lee, Oskar Mencer, Automating Custom Precision Function Evaluation for Embedded Processors, CASES 05, September 24 27, 2005, San Francisco, California, USA, pp [18] Anish Muttreja, Anand Raghunathan, Automated Energy/Performance Macromodeling of Embedded Software, IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 3, MARCH 2007, pp [1] Wayn Wolf, High-performance embedded computing, Elsevier 2007 ISBN 13: [2] Lizy Kurian John Lieven Eeckhout, Performance evaluation and Benchmarking, CRC Press 2006, ISBN [3] Ahmed Amine Jerraya, Sungjoo Yoo, Diederik Verkest, Norbert Wehnn, Embedded Software for SOC, KLUWER ACADEMIC PUBLISHERS 2003, ebook ISBN: [4] XILINX, MicroBlaze Processor Reference Guide, Embedded Development Kit, [5] XILINX, PowerPC 405 Processor Block Reference Guide, Embedded Development Kit, June 5, [6] XILINX, Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Complete Data Sheet, November 5, [7] Steven J. E. Wilton, Noha Kafafi, James C. H. Wu,, Kimberly A. Bozman,Victor O. Aken Ova, and Resve Saleh, Design Considerations for Soft Embedded Programmable Logic Cores [8] Ludovic L Hours, Generating Efficient Custom FPGA Soft-Cores for Control-Dominated Applications, Proceedings of the16th International Conference on Application-Specific Systems, Architecture and Processors (IEEE ASAP 05). [9] [10] [11] Vittorio Cortellessa, Pierluigi Pierini, and Daniele Rossi, Integrating Software Models and Platform Models for Performance Analysis, IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 33, NO. 6, JUNE 2007, pp

97 Embedded Microprocessor Systems Hardware Performance Evaluation and Benchmarking Nabil LITAYEM Meftah GHRISSI Slim BEN SAOUD Institut National des Sciences Appliquées et de Technologies INSAT Centre Urbain Nord BP Tunis Cedex LECAP-EPT Abstract Since embedded systems are very specific platforms, measuring hardware performance of this kind of systems become very important task for any embedded system design process. In a classic case hardware performance is a basic result reported by the hardware manufacturer. For designer of FPGA based embedded systems the personalization of hardware configuration is a fundamental task for this kind of systems, then designer must measure the hardware performance itself. This paper will focus on hardware performance analysis of FPGA based embedded system, we will use as example two embedded systems based on LEON3MMU processor and ecos RTOS. The first one is mono-processor system, the second one is bi-processor. I. INTRODUCTION The human activity becomes more and more reliant to embedded systems that are actually present in many products like PDA, camera, telephones etc. Designing this kind of systems can take many approaches depending on the used platform. In a classic approach General Purpose processor (GPP), Application Specific Processor (ASIP) or Application Specific Integrated Circuit (ASIC) can be used as a heart of embedded system. Each one of precedent solutions has its advantages and weaknesses. Actually we assist to emergence of embedded systems based on FPGA. This kind of solution allow us rapid embedded systems generation [1], easy personalization of hardware configuration using predesigned Hardware IPs [2], future evolution of embedded system and cost reduction. To design embedded system on FPGA based platform we need to choose an embedded processor and embedded operating system. Embedded processor can be Hard-Core (built in silicon level) or Soft-Core (netlist or as HDL source). Hardcore embedded processor has the advantage of computing performance but limit the system in term of portability and evolutivity. Soft-Core embedded processor offer less computing possibility if the final platform is an FPGA, but is greatly enhanced in term of configurability, portability, customization and evolutivity [3] [4]. Embedded operating system can be shared time or real time, proprietary or open source. The choice of an embedded operating system depend on available memory, culture of developers, time characteristic etc. Since embedded software is usually developed under very limited hardware resources, the embedded system developer must have a clear idea about the hardware computing performance. Actually many studies focuses on hardware performance evaluation of customized architecture [5]. In this paper we present an approach to measure hardware performance of FPGA based embedded system using freely available benchmark solutions. II. OVERVIEW OF PERFORMANCE EVALUATION TOOLS AND TECHNIQUES Evaluating performance in computer [6] system will always be a true challenge for designer of this kind of systems due to the constant evolution of such systems, 1

98 especially for embedded system field where the architecture tend to be more and more complex [7]. The performance analysis of embedded system has multiple aspects depending on the application that the system is made to and several design decisions are a direct result of performance evaluation. The first evaluation method was based on the number of operations per second, but quickly we note the necessity of building synthetic benchmarks able to report true performance and weaknesses of the computer system. Actually we have too many solutions to measure hardware performance. The most part of solutions are based on standard algorithms that execute various computing algorithm and report a number which reflects the performance of the architecture in a special field. The most recognized solutions are Dhrystone which reports the performance of the architecture in Dhrystone MIPS, Stanford which computes different algorithms and report performance of the architecture in every computation field covered by the benchmark, and Paranoia able to report the characteristics of the floating point unit. We can also find other commercial benchmarking solutions more efficient and more specialized like SPEC (Standard Performance Evaluation Corporation) which cover different computing field or EEMBC (Embedded Microprocessor Benchmark Consortium) designed especially for embedded systems. In this paper we will present a hardware performance analysis of mono-processor and bi-processor embedded systems using three benchmarks, each one cover one side of computing system. Our platform is based on two open source components (LEON3 Processor and ecos RTOS) allowing us to be independent to any FPGA or RTOS vendor. III. PRESENTATION OF THE USED PLATFORM A. Overview of LEON3 microprocessor LEON3 [8] presented in Figure 1 is a synthesizable VHDL model of a 32-bit processor compliant with the SPARC V8 architecture. The model is highly configurable, and particularly suitable for system-on-a-chip (SOC) designs. The full source code is available under the GNU GPL (General Public License), allowing free and unlimited use for research and education. LEON3 is also available under a low-cost commercial license, allowing it to be used in any commercial application to a fraction of the cost of comparable IP cores. The LEON3 processor is distributed as a part of the GRLIB IP library, allowing simple integration into complex SOC designs. GRLIB also includes a configurable LEON3 multi-processor design, with up to 16 CPU's attached to AHB bus, and a large range of on-chip peripheral blocks. IRQ 15 Interrupt Control MUL32 MAC 16 DIV 32 3-Port Regfile 7-Stage Integer Pipeline Instruction Cache MMU Data Cache IEEE 754 Floating- Point Unit Co-Processor Debug Interface Trace Buffer AMBA AHB Interface 32 Minimum Configuration Optional Blocks Co-Processors Figure 1. Overview of LEON3 Architecture B. Overview of ecos RTOS ecos an acronym of Embedded Configurable Operating System [9] is an open source, royalty-free, real-time operating system intended for embedded applications. The highly configurable nature of ecos allows the operating system to be customized to precise application requirements, delivering the best possible run-time performance and an optimized hardware ressource footprint. A thriving net community has grown up around the operating system ensuring on-going technical innovation and wide platform support. Libraries Application Kernel Compatibility Math C POSIX µitron Web Server Hardware Abstraction Layer RedBoot ROM Monitor Interrupts Virtual Exceptions Vectors Target Hardware Networking Stack Ethernet Serial Figure 2. Overview of ecos Architecture Device Driver Debug I/F File System Flash The main components in ecos architecture are the HAL (Hardware Abstraction Layer) and ecos Kernel. The purpose of ecos HAL is to allow the application to be independent of hardware target; it can manipulate the hardware layer using the HAL API. This HAL is also used by others upper OS layer which make porting ecos to a 2

99 new hardware target a simpe task consisting on developing the HAL of the new target. ecos kernel is the core of ecos system, it include the most part of modern operating systems components: scheduling, synchronization, interrupt, exception handling, counters, clocks, alarms, timers It is written in C++ language allowing application written in this language to interface directly to the kernel resources. The ecos kernel also supports interfacing to standard µitron and POSIX compatibility layers. C. Combination ecos LEON3 The choice of these two components allows us to be independent from any FPGA constructor or RTOS vendor. We can build our system in XILINX, ALTERA, ACTEL or any other FPGA constructor and without any royalties to pay for RTOS vendor. But this choice is not the only one in this sense. We can also use OpenRISC [10] as processor and RTEMS [11] or embedded Linux [12] as OS. The performance measure will be presented using the three benchmarks that we will present, and the hardware platform will be simulated using tsim-leon3 for a monoprocessor architecture and grsim-leon3 for the bi-processor platform in SMP (synchronous multiprocessing) configuration. These two simulation tools are able to represent very closely the LEON3 architecture with many other very important features for system prototyping. IV. BENCHMARKS USED In our study we will focus on three benchmarks A. Dhrystone Dhrystone is a synthetic benchmark developed in 1984 by Reinhold P.Weicker intended to be representative of integer system performances. The Dhrystone grew to become representative of general processor (CPU) performance until it was outdated by the CPU89 benchmark suite from the Standard Performance Evaluation Corporation, today known as the "SPECint" suite. B. Stanford The Stanford Benchmark Suite is a small benchmark suite that was assembled by John Hennessy and Peter Nye around the same time period of the MIPS R3000 processors. The benchmark suite contains ten applications, eight integer benchmarks and two floating-point benchmarks. The original suite measured the execution time in milliseconds for each benchmark in the suite. The Stanford Small Benchmark Suite includes the following programs: Perm : A tightly recursive permutation program. Towers : the canonical Towers of Hanoi problem. Queens : The eight Queens Chess problem solved 50 times. Integer MM : Two 2-D integer matrices multiplied together. FP MM : Two 2-D floating-point matrices multiplied together. Puzzle : a compute bound program. Quicksort : An array sorted using the quicksort algorithm. Bubblesort : An array sorted using the bubblesort algorithm. Treesort : An array sorted using the Treesort algorithm. FFT : A floating-point Fast Fourier Transform program. This kind of benchmark is very interesting in term of exploration of various architecture behaviors [11]. C. Paranoia Designed by William Kahan the first IEEE 754 standardization team, Paranoia has as essential purpose to characterize floating-point behavior of computer system. Paranoia does the following test: Small integer operations. Search for radix and precision. Check if rounding is done correctly. Check for sticky bit. 2 Test if X = X for a number of integers. If it will passes monotonicity. If it is correctly rounded or chopped. i Testing power Z, for small Integers Z and i. Searching for underflow threshold and smallest positive number. Q Testing power Z at four nearly extreme values. Searching for overflow threshold and saturation. Tries to compute 1/0 and 00. V. PERFORMANCE MEASURES After preparing the environment in term of configuring and building ecos for LEON3MMU architecture, testing some applications examples running under ecos like multithread application. We build the three different benchmarks for our two platforms. The first hardware configuration for the future test consists of: Mono-processor Configuration 16 Mbyte of SDRAM memory in 1 bank Kbyte ROM memory. The size of instruction cache and data cache is booth at 1*4 Kbytes, 16 bytes/line. 3

100 The second hardware configuration for the future test consists of: Bi-processor Configuration: 16 Mbyte of SDRAM memory in 1 bank Kbyte ROM memory. For the two processors the size of instruction cache and data cache is booth at 1*4 Kbytes, 16 bytes/line. The two processors are connected in SMP configuration. It should be noted that the benchmarks are loaded and executed from SDRAM. B. Results obtained using Stanford benchmark After executing Stanford in our platform simulator we have the following performance report: A. Results obtained using Dhrystone benchmark After executing Dhrystone benchmark under our platforms simulators we have the reported values for our chosen system. Figure 3 show the performance of each architecture in term of Dhrystone MIPS. The gain of performance is about 33% with the bi-processor configuration. Figure 4. Execution Time in millisecond of the ten algorithms included in Stanford benchmark Figure 5. Composite performance of the two architectures for Nonfloating and floating point applications Figure 3. Performance in Dhrystones MIPS of the two architecture TABLE 1. RESULTS OBTAINED WITH THE TWO PLATFORM SIMULATOR FOR DHRYSTONE BENCHMARK Mono-processor Architecture Bi-processor Architecture Cycles Instructions Overall CPI CPU performance (50.0 MHz) MOPS (31.66 MIPS, 0.00 MFLOPS) MOPS (41.43 MIPS, 0.00 MFLOPS) TABLE 2. RESULTS OBTAINED WITH THE TWO PLATFORM SIMULATOR FOR STANFORD BENCHMARK Mono-processor Architecture Bi-Processor Architecture Cycles Instructions Overall CPI CPU performance (50.0 MHz) Cache hit rate Simulated time MOPS (29.49 MIPS, 0.59 MFLOPS) MOPS (41.79 MIPS, 0.83 MFLOPS) 96.5 % (99.8 / 75.1) 93.5 % (99.8 / 60.2) ms ms After examining the simulator report we conclude that we have a gain of performance of 12 % for integer operation, and 32 % for floating point operations while using bi-processor configuration. 4

101 This gain of performance is not equally distributed between the ten algorithms included in Stanford Benchmark, and the choose of one of these two architectures will depend on the final application. C. Results obtained using paranoia benchmark After executing paranoia benchmark under the two platform simulators we conclude that the FPU operation is correctly executed for the two architectures. But the benchmark reports that we have: Addition/Subtraction neither rounds nor chops. Sticky bit used incorrectly or not at all. FLAW: lack(s) of guard digits or failure(s) to correctly round or chop (noted above) count as one flaw in the final tally below. This type of failure is not so dangerous for the system functionality but can cause some precision loss. The source of this failure is certainly caused by the code generation in the soft-float parameters of GCC compiler. VI. DISCUSSION The reported results by the three benchmarks cover three field of computing system performance, Dhrystone allowed us to compare integer unit performance, Stanford to compare different standard algorithm performance in integer and floating point computing, and Paranoia to characterize floating point operations. The obtained results by the three benchmarks are coherent for mono-processor and bi-processor architectures. The same approach can be used to compare performance of other architectures, but this kind of work can be done carefully since a few study report some fragility s of SPEC CPU95 and CPU2000 [13] which is a superset of our used benchmarks, others study focus on benchmark that must be closely adapted to application that the target will be used for [14]. REFERENCE [1] Jorden Peddersen, Seng Lin Shee, Andhi Janapsatya, Sri Parameswaran, Rapid Embedded HW/SW System Generation.In Proceedings of the 18 th International Conference on VLSI Design help jointly with 4 th International Conference on Embedded Systems Design (VLSID 05), 2005 IEEE. [2] David Sheldon, Rakesh Kumar, Frank Vahid, Dean Tullsen, Roman Lysecky, Conjoining Soft-Core FPGA Processors, ICCAD 06, November 5-9, 2006, San jose, CA, pp [3] Ludovic L Hours, Generating Efficient Custom FPGA Soft-Cores for Control-Dominated Applications, Proceedings of the16th International Conference on Application-Specific Systems, Architecture and Processors (IEEE ASAP 05). [4] Pablo Huerta, Javier Castillo, Jose Jgnacio Martinez, César Pedraza, Exploring FPGA Capabilities For Building Symetric Multiprocessor Systems, Programmable Logic, SPL ' rd Southern Conference, IEEE 2007, pp [5] Johann Groβschädl, Stefan Tillich, Alexander Szekely, Performance Evaluation of Instruction Set extensions for Long Integer Modular Arithmetic on a SPARC V8 Processor, IEEE DSD [6] Lizy Kurian John Lieven Eeckhout, Performance evaluation and Benchmarking, CRC Press 2006, ISBN [7] Wayn Wolf, High-performance embedded compu-ting, Elsevier 2007 ISBN 13: [8] [9] Anthony J.Massa Embedded Software developpement with ecos, ISBN Prentice Hall [10] [11] [12] Karim Yaghmour, Building Embedded Linux Systems, O Reilly 2003 ISBN: X. [13] Hans Vandierendonck, Koen De Bosschere, Eccentric and Fragile Benchmarks, 2004 IEEE International Symposium on ISPASS, pp [14] Ajay M.Joshi, Lieven Eeckhout, and Lizy K.John, Exploring the Application Behavior Space Using Parameterized Synthetic Benchmarks, Parallel Architecture and Compilation Techniques, PACT th International Conference, pp [15]Frederick M. Proctor and William P. Shackleford Real-time Operating System Timing Jitter and its Impact on Motor Control proceedings, 2001 SPIE Conference on Sensors and Controls for Intelligent Manufacturing II. [16]Ahmed Amine Jerraya, Sungjoo Yoo, Diederik Ver-kest, Norbert Wehnn, Embedded Software for SOC, KLUWER ACADEMIC PUBLISHERS 2003, ebook ISBN: VII. CONCLUSION AND PERSPECTIVES The three tools used in this paper are very important free benchmarks able to report pure hardware performances for prototyping embedded microprocessor system. But this kind of measure must be extended to report the true performance especially for real time embedded systems since isn t able to compare OS characteristics. For example measuring RTOS performances [15] [16], like overhead and interruptions responses, preemptive scheduling [17]... For this reasons we attempt to extend our study by measuring the performance of ecos and comparing it with other RTOS like RTEMS for the same platform. 5

Montrer encore