SYCL goes Green with SYnergy

December 11, 2023 by Biagio Cosenza, University of Salerno, Italy, in collaboration with CINECA Supercomputing center, and member of SYCL Working Group in Khronos. sycl

SYnergy research project enables energy efficient C++ based heterogeneous parallel programming with the Khronos SYCL API

In high performance computing, creating energy efficiencies has become critical - from the design of cooling systems to the development of low-power hardware architectures. Processor hardware can help reduce power consumption by using dynamic voltage and frequency scaling.

Many GPU vendors provide APIs to take advantage of hardware power/energy scaling capabilities. Programmers are usually interested in two functionalities: querying either the current power draw or the accumulated energy consumption of a defined power domain; and the ability to dynamically scale the core or memory frequency. However, these vendor APIs are typically very different from each other, with no common interface to provide portable functionality. GPU power interfaces are also quite limited compared to their CPU counterparts due to GPUs’ limited support for power domains. The result is no portable interface for power and energy scaling exists, and these proprietary interfaces are not easily integrated into modern programming models for heterogeneous computing such as SYCL™ and OpenCL™ from the Khronos® Group.

Energy Efficiency Requires Abstractions

SYnergy [1] is a SYCL-based API for energy-efficient computing. The API, inspired by Celerity [2], provides a common interface that enables energy profiling and frequency scaling on devices from different vendors without requiring developers to work with vendor-specific libraries. Under the hood, SYnergy maps high-level energy semantics to vendor APIs such as LevelZero for Intel GPUs, NVML for NVIDIA, and ROCm SMI for AMD.

Figure 1 shows the SYnergy approach. The main entry point is the synergy::queue class, which extends the standard SYCL queue with energy capabilities. SYnergy provides both coarse-grained and fine-grained energy capabilities. In particular, it allows to specify an energy target for each kernel, which sets the appropriate kernel-dependent frequency configuration. At compile time, an LLVM pass extracts code features for each kernel and predicts the target frequency configuration using a pre-trained machine learning model. At runtime, SYnergy simply selects the pre-calculated optimal frequency setting for that specific kernel and target.

It is important to note that when dealing with energy optimization such as frequency scaling, there is no single optimal configuration, but since it is a multi-objective problem, we have a set of Pareto-optimal configurations. To provide a high-level, easy-to-use tuning interface, SYnergy provides a set of energy targets, as shown in Figure 2. SYnergy supports classical energy metrics such as MIN_ENERGY, which returns the frequency that minimizes the per-task energy consumption, and MAX_PERF for maximum performance. Scalar energy-delay metrics such as EDP and ED2P are also supported. To provide a better understanding of the tradeoff solutions that can potentially be selected, it also defines a set of new metrics that aim to provide easy-to-use and interpretable energy tradeoffs. For example, energy savings targets such as ES_75 return a configuration that delivers 75% of the potential energy savings using the default frequency as a baseline. Another way to look at energy tradeoffs is to focus on performance loss rather than energy savings. For instance, the performance loss target PL_25 returns a frequency configuration that delivers a 25% performance degradation in terms of the potential performance loss using the default frequency as a baseline.

Running on Large-Scale Compute Clusters

The SYnergy approach has been experimentally evaluated on NVIDIA V100, AMD MI100, and Intel Max Series 1100. A large-scale experimental evaluation has also been performed on the Marconi100 cluster at CINECA, thanks to a SLURM scheduler plugin that enables energy measurement and frequency scaling on large clusters. The results show scalable energy savings with up to 64 GPUs. The SYnergy approach and the newly defined metrics allowed us to find solutions with up to 30% and 20% energy savings compared to the default configuration on MiniWeather and CloverLeaf applications.

About SYnergy

SYnergy is a research project led by the University of Salerno, Italy, in collaboration with CINECA Supercomputing center. The SYnergy paper has been presented by Kaijie Fan at SC 2023. The artifact source code, which has passed the artifact evaluation, is available on GitHub [3]. The SYNERGY project has been funded by the European High-Performance Computing Joint Undertaking (JU) under Grant Agreement No 956137 (LIGATE project) and No. 956560 (REGALE project).

About SYCL

First introduced in 2014, SYCL is a C++ based heterogeneous parallel programming framework for accelerating High Performance Computing (HPC), machine learning, embedded computing, and compute-intensive desktop applications on a wide range of processor architectures, including CPUs, GPUs, FPGAs, and tensor accelerators. SYCL 2020 builds on the functionality of SYCL 1.2.1 to provide improved programmability, smaller code size and increased performance. Based on C++17, SYCL 2020 enables easier acceleration of standard C++ applications and drives a closer alignment with the ISO C++ roadmap. SYCL 2020 accelerates adoption and deployment of SYCL across multiple platforms, including the use of diverse acceleration API backends in addition to OpenCL.

[1] Kaijie Fan, Marco D'Antonio, Lorenzo Carpentieri, Biagio Cosenza, Federico Ficarelli, Daniele Cesarini: SYnergy: Fine-grained Energy-Efficient Heterogeneous Computing for Scalable Energy Saving. SC 2023: 69:1-69:13
[2] Peter Thoman, Philip Salzmann, Biagio Cosenza, Thomas Fahringer:
Celerity: High-Level C++ for Accelerator Clusters. Euro-Par 2019: 291-303
[3] SYnergy on gitHub - SYnergy: Energy Measurement and Frequency Scaling for SYCL applications