Conservative design margins in modern multicore CPU chips aim to guarantee correct execution of the software layers of computing system under various operating conditions and accounting for the inherent variability among different cores of the same CPU chip, among different manufactured chips and among different workloads. However, guard-banding the main operational parameters of CPU chips (voltage, frequency), leads to limited energy efficiency.
In this tutorial we will present recent studies on design margins characterization, identification and harnessing in modern multicore CPUs.
- We will present the main challenges (and how they can be addressed) of the massive process of characterization and identification of the design margins. Such a process aims to identify different types of variability of modern multicore CPUs (across cores, across chips and across workloads). Also, it aims to analyze the system behavior in scaled conditions (what types of malfunctions are observed - program Silent Data Corruptions, corrected and uncorrected errors captured by the hardware, application and system crashes - and what are the corresponding failure probabilities). We will discuss how such a process can be automated and how the margins of different CPU chips can be efficiently recorded. Implementation of the characterization process in state-of-the-art ARMv8-based servers will be presented.
- We will present findings about the magnitude of power and energy that can be saved through the exploitation of the margins and the variability. The characterized design margins and the variability among cores and chips can drive (static or dynamic) workload balance decisions at the system level based on voltage and frequency scaling knobs of the underlying hardware.
- We will discuss the effectiveness of dedicated micro-viruses for the identification of the safe Vmin and failure probabilities compared to classic benchmarks workloads and how the time-consuming characterization process can be significantly accelerated.
- We will quantify the importance of clock frequency, thread/core allocation and workload behavior on scaled voltage operation for energy efficiency and how this quantification can be exploited by the system layers for task allocation aiming either energy reduction or a balanced energy-performance operation.
- The tutorial analysis is based on real system measurements in different multicore server CPU chips mainly based on ARMv8 architecture (such as AppliedMicro's X-Gene 2 and X-Gene 3, and Cavium's ThunderX). Discussion and comparison among the implementations and also with different architectures (mainly Intel and AMD x86 CPU chips) will also take place.
- We will present a health monitoring service developed to deliver error reports and symptoms of potential upcoming failures. The monitor was designed to provide flexible interface and extensability on describing critical conditions of the system.
- We will discuss the modeling of the behavior of CPUs when operating in scaled conditions by employing microarchitectural simulators. The system-level characterization measurements and behavior recording can be the source for modeling at the simulation level.
The purpose of the tutorial is to summarize recent characterization and exploitation findings on multicore CPUs in server machines, emphasize on the potential of energy saving through identification and exploitation of design margins and to discuss our reports and findings to other machines similarly studied in the past.
Dimitris Gizopoulos (firstname.lastname@example.org) is Professor at the Department of Informatics & Telecommunications of the National & Kapodistrian University of Athens in Greece where he leads the Computer Architecture Laboratory. The group's research focuses on the area of Dependable and Energy-Efficient Computer Architecture, and in particular reliability assessment, fault/error tolerance, design correctness validation, design margins harnessing and their relation to performance and energy-efficiency for microprocessors. Gizopoulos has published more than 170 papers in top-tier conferences and journals, has served as Associate Editor for several IEEE Transactions and Magazines (TC, TVLSI, D&T, TSUSC) and as member of several Program, Organizing and Steering Committees of major IEEE and ACM technical conferences.
Gizopoulos is an IEEE Fellow, a Golden Core member of the IEEE Computer Society and a Senior ACM member.
George Papadimitriou (email@example.com) received the B.Sc. degree on Electronic Computer Systems Engineering from Technological Educational Institute of Piraeus, Greece and the M.Sc. degree from the Dept. of Informatics & Telecommunications at University of Athens, Greece. He is currently a Ph.D. student in Dept. of Informatics & Telecommunications at University of Athens. His research interests focus on Energy-Efficient architectures, reliability of modern computer architectures, functional correctness of hardware designs and design validation of microprocessors and microprocessor-based systems.
Athanasios Chatzidimitriou (firstname.lastname@example.org) is a PhD student at the University of Athens working on methods and tools for microarchitecture level reliability assessment as well as energy-efficient computing. He holds a BSc in Computer Engineering and an MSc in Computer Science. He is the lead developer of GeFIN.
ISPASS 2018 - "Micro-Viruses for Fast System-Level Voltage Margins Characterization in Multicore CPUs", G. Papadimitriou, A. Chatzidimitriou, M. Kaliorakis, Y. Vastakis, D. Gizopoulos, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2018), Belfast, Northern Ireland, United Kingdom, April 2018.
CAL 2018 - "Statistical Analysis of Multicore CPUs Operation in Scaled Voltage Conditions", M. Kaliorakis, A. Chatzidimitriou, G. Papadimitriou, and D. Gizopoulos, IEEE Computer Architecture Letters (CAL 2018), Volume: 17, Issue: 1, February 2018.
DATE 2018 - "An Energy-Efficient and Error-Resilient Server Ecosystem Exceeding Conservative Scaling Limits", G. Karakonstantis, K. Tovletoglou, L. Mukhanov, H. Vandierendonck, D. S. Nikolopoulos, P. Lawthers, P. Koutsovasilis, M. Maroudas, C. D. Antonopoulos, C. Kalogirou, N. Bellas, S. Lalis, S. Venugopal, A. Prat-Perez, A. Lampropulos, M. Kleanthous, A. Diavastos, Z. Hadjilambrou, P. Nikolaou, Y. Sazeides, P. Trancoso, G. Papadimitriou, M. Kaliorakis, A. Chatzidimitriou, D. Gizopoulos, and S. Das, ACM/IEEE Design, Automation, and Test in Europe (DATE 2018), Dresden, Germany, March 19-23, 2018.
MICRO 2017 - "Harnessing Voltage Margins for Energy Efficiency in Multicore CPUs", G. Papadimitriou, M. Kaliorakis, A. Chatzidimitriou, D. Gizopoulos, P. Lawthers, and S. Das, IEEE/ACM International Symposium on Microarchitecture (MICRO 2017), Cambridge, MA, USA, October 2017.
IOLTS 2017 - "Voltage Margins Identification on Commercial x86-64 Multicore Microprocessors", G. Papadimitriou, M. Kaliorakis, A. Chatzidimitriou, C. Magdalinos, D. Gizopoulos, IEEE International Symposium on On-Line Testing and Robust System Design (IOLTS 2017), Thessaloniki, Greece, July 2017.
SELSE 2017 - "A System-Level Voltage/Frequency Scaling Characterization Framework for Multicore CPUs", G. Papadimitriou, M. Kaliorakis, A. Chatzidimitriou, D. Gizopoulos, G. Favor, K. Sankaran and S. Das, IEEE Silicon Errors in Logic & System Effects (SELSE 2017), Boston, MA, USA, March 2017.