Hi! In this class we are going to have a bird's eye view on the available SDAccel optimizations. All the presented optimisations can be also found in the SDAccel Environment Optimization Guide provided by Xilinx and available online. Obviously, the optimizations presented here are not the only available ones, but they are more a list of recommendations to optimise the performance of an OpenCLTM application that have to be used as a starting point for ideas to consider or investigate further. Within this context I’d like to organise these “recommendations” in three sets of optimizations: arithmetic optimizations, data-related optimizations and finally, last but not least, memory-related optimizations. Arithmetic optimizations are related, as the name is suggesting, to the possibility of exploring optimizations by means of using optimized math functions, and in the case of the built-in Math Functions from HLS MATH Library or by exploring optimized implementations by using specific data representation as in the the case of fixed point arithmetic. Data-related and memory-related optimizations are much more “blurry” in the sense that, some data-related optimizations can have some memory-related aspects/impacts to make totally fare to consider them as also memory-related optimization. Said that, just for the sake of simplicity I still like to have this high level distinction and in considering data-related optimizations as specific optimizations on data and computation, while as memory-related optimizations, recommendations on how to work with memories, interfaces and data transfer. Data-related optimizations can be divided into two optimizations: choosing optimal work-group size and in being able to isolate data transfer and computation. Finally, Memory-related optimizations are used to exploit at the best memories and memory bandwidth utilization and to transfer data efficiently by using specific commands, like the clEnqueueMigrateMemObjects. Let us now explore a little bit more these optimizations/recommendations starting from the arithmetic ones. The first recommendation is to use optimized built-in math functions from HLS MATH Library. OpenCL Specification provides several math built-in functions and we can gain advantages out of them! All the math built-in functions with the native_ prefix are mapped to one or more native device instructions and will typically have better performance compared to the corresponding functions, the ones without the native_ prefix. In the Xilinx SDAccel environment these native_ built-in functions exploit the equivalent functions in Vivado HLS MATH Library. Now, this is were the benefit of using built-in math functions comes into play, and this is because the functions in the Vivado HLS Math library have been already optimized for Xilinx FPGAs in terms of area and performance. It is important to notice that the accuracy, and in some cases the input ranges of these functions, is implementation-defined. Therefore, it is important to verify that the accuracy meets the application requirements but once this is done, it’s definitely recommended to use native_ built-in functions. A second optimization can be obtained by exploring fixed point arithmetic. This is the eternal fight in between fixed and floating point computations. In both cases we are referring to the corresponding manner in which numbers are represented. On one hand by using a fixed point representation we are fixing the number of digits with respect to the decimal point. On the other hand, with a floating point representation, the placement of the decimal point can vary, "float", relative to the significant digits of the number. Within this context, just to name a difference, floating point can support a wider range of values with respect to a fixed point representation, with the ability to span from very small numbers to very large ones. Now, because of this, some applications use floating point computation but all these benefits in using a floating point arithmetic come at a cost… managing floating point representation is more complex if compared to fixed point arithmetic. Within this context, it may happen that using fixed point arithmetic can save the power efficiency and area significantly while keeping the same level of accuracy, as it can be the case of some deep learning algorithms. As deep learning inference exploits lower bit precision without sacrificing accuracy it can be convenient to explore alternative implementations, such as the INT8 deep learning operations implemented on the Xilinx DSP48E2 slice, rather than using right away floating point arithmetics. Therefore the general recommendation, before coming to using floating point operations for your application, is to explore fixed point arithmetic first. Before continuing, if interested in knowing more about deep learning and on how to efficiently use fixed point representations, I’d like to suggest you to read the “Deep Learning with INT8 Optimization on Xilinx Devices” white paper. Let us now see the impact of being able to choose the optimal work-group size As we know, OpenCL computational model is built around the logic abstraction of work-item and work-group. Just as a reminder, a work-item is the basic unit of work within an OpenCL device, while a work-group is a group of work-items. Within an OpenCL code, OpenCL computational model requires the user to specify both the Global size, i.e. the N-dimensional size of the total number of work items, and the Local size, i.e. the N-dimensional work-group size and by doing this we will give the compiler more flexibility to optimize the size of the kernel. Global and local size can be 1D, 2D, and 3D, according to the dimensionality of the problem to process and this means that OpenCL can process, at most, 3D problems. The next optimization I’d like to introduce is the isolation of data transfer and computation. This is not really an optimization, but it is more a kind of recommendation. Being able to separate the kernel computation from the corresponding communication infrastructure to transfer data is quite important to be able to optimize them both. Separating out the two will help you to better understand where to begin your optimizations. To better understand the performance of your kernels is crucial to be able to isolate either the data transfers or the computation to focus on specific potential kernel optimization or to be able to use simpler control structures in the read/write function which makes burst data transfer detection simpler. Finally, last but not least, we can explore the memory-related optimizations/suggestions. We can summarise them as in the following list: - Using the clEnqueueMigrateMemObjects to transfer data - Avoiding complex structures or classes for kernel arguments - Using on-chip memories and - Maximizing the utilization of global memory bandwidth. The clEnqueueMigrateMemObjects command, as we can read in the "SDAccel Environment Optimizatioon Guide", migrates memory objects explicitly performed ahead of the dependent commands. This allows the application to preemptively change the association of a memory object, through regular command queue scheduling, in order to prepare for another upcoming command and it also allows applications to overlap the placement of memory objects with other unrelated operations before these memory objects are needed potentially hiding transfer latencies. Avoiding complex structures or classes for kernel arguments is quite crucial because, as we know, kernel arguments are mapped onto hardware interfaces between the host code and the FPGA. This means that complex structures or classes can lead to very complex hardware interfaces due to memory layout and data packing differences. Using on-chip memories can improve the efficiency and performance of an accelerated application. This is because acceleration platforms supported in the SDAccel environment can have as much as 10MB on-chip memories that can be used as on-chip global memories pipes local and private memories. Finally, efficient data movement between the kernel running on the underlying FPGA device and the external global memory is critical to the performance of accelerated applications. Having the best implementation ever, from a computational perspective, running on the FPGA, without being able to efficiently read and write data from external DDR SDRAM, as an example, it is going to end into a useless design! A well-designed kernel minimizes memory accessing latency while maximizing the usage of the available data bandwidth provided by the acceleration platform. That’s why using burst data transfers, using full user data width of memory controller, and using multiple DDR banks are considered key optimizations and deserve a specific class to be presented.