High-performance computing (HPC) systems offer extensive computational resources designed to support demanding, distributed applications. These resources include traditional CPUs and GPU accelerators, which can significantly enhance application performance. Efficient utilization of these subsystems is crucial for reducing both time to solution and energy consumption. Various profiling tools are essential to gain insight into how an application leverages a distributed system with CPUs and GPUs and should be integrated to the development and optimization workflow of the application. This ensures that computational resources are utilized to their fullest potential.
On CSCS platforms, you have access to a suite of powerful performance analysis tools specifically designed for parallel, distributed, and GPU-accelerated applications. These tools enable you to analyze how your application utilizes compute resources, optimize its performance, and identify any potential bottlenecks.
Learning to profile applications effectively is crucial to build a deeper understanding of how your code interacts with the underlying hardware. In this section we’ll introduce you to the various performance analysis solutions available at CSCS. If you have issues or questions about performance analysis tools, please do not hesitate to contact us on support.cscs.ch.
Linaro Forge profiling tool (MAP)
The Linaro Forge MAP profiling tool is a high-performance sampling profiler designed to optimize the efficiency of software running across multiple multicore processors and GPUs. It provides "hot-spot" analysis, offering stack traces and performance metrics related to CPU and memory usage, MPI communication, I/O, GPU utilization, and energy consumption. The tool supports profiling for applications written in C, C++, Python, and Fortran.
One key advantage of Linaro Forge MAP is that, as a sampling profiler, it requires no special compilation flags other than the inclusion of the debug symbol flag -g
. To profile an application, simply run it using the Linaro MAP profiler executable.
For more information regarding how to use Linaro MAP
with UENV
refer to the Linaro Forge UENV user guide.
Forge can only be used at CSCS with NVIDIA GPUs. CSCS does not have a license to profile and debug codes on AMD GPUs.
NVIDIA Nsight Systems
NVIDIA Nsight Systems is a system-wide performance analysis tool that enables developers to gain a deep understanding of how their applications utilize computing resources, such as CPUs, GPUs, memory, and I/O. The tool provides a unified view of an application's performance across the entire system, capturing detailed trace information that allows users to analyze how different components interact and where performance issues might arise. A key advantage of Nsight Systems is its ability to provide detailed traces of GPU activity, offering deeper insights into GPU utilization. It features a timeline-based visualization, enabling developers to inspect the execution flow, pinpoint latencies, and correlate events across different system components. As a sampling profiler, it can be easily used to profile applications written in C, C++, Python, Fortran, or Julia by wrapping the application with the Nsight Systems profiler executable.NVIDIA Nsight Systems
is available with any UENV that comes with a CUDA
compiler.
NVIDIA Nsight Compute
NVIDIA Nsight Compute is a performance analysis tool specifically designed for optimizing GPU-accelerated applications. It focuses on providing detailed metrics and insights into the performance of CUDA kernels, helping developers identify performance bottlenecks and improve the efficiency of their GPU code. Nsight Compute offers a kernel-level profiler with customizable reports, enabling in-depth analysis of memory usage, compute utilization, and instruction throughput. As a sampling profiler, it can be easily used to profile applications written in C, C++, Python, Fortran, or Julia by wrapping the application with the Nsight Compute profiler executable.
NVIDIA Nsight Compute
is available with any UENV that comes with a CUDA
compiler.