Hello, everyone, thanks for joining us. This is your introduction to the OpenCL standard as used with Intel's FPGA. So stay tuned. This should be fun. This course will be partitioned into five modules. The first will provide a brief motivation but the necessity of parallel computing along with its inherent challenges. For the second module, we'll begin to craft an understanding of OpenCL as a viable framework for the development of heterogeneous parallel computing solutions on Intel's FPGAs. In the third module, we'll then explore the basics of the OpenCL standard. The fourth module, you learn to write your very own OpenCL program. And finally in module five, you'll get a compile to run your OpenCL programs using the Intel FPGA SDK through OpenCL. To begin, we'll go over an introduction to heterogeneous parallel computing. As Almasi and Gottlieb were quoted saying in 1989, parallel computing is a form of computation in which many calculations are carried out simultaneously. Operating on the principle that large problems can often be divided into smaller ones but then solved concurrently. Extracting better performance from compute systems has become a challenging problem for several reasons. Reason one is the power wall, as individual transistor sizes decrease, the number of transistors placed in routed on chip increases. And thus the sum total power dissipated will increase. Typically we're interested in low power consumption. Reason two is instruction-level of parallelism. As individual processors nowadays are so fast, we're now reaching a point where instructions which are low level commands to the CPU, can no longer be divided and organized in ways that optimally utilize the processor. Reason three is the memory wall, this is caused by increasing gap between processor and memory speeds, it in effect pushes for cache size to be larger in order to mask how slow the memory is. Now there are three main types of solutions that use Parallelism that we hope you can get an appreciation for. Data parallelism, task parallelism, and pipeline parallelism. Data parallelism allows us to take input data that we want to work on, split it into a small piece of data and give each piece to its own compute device to work on. As parallelism takes a complex problem, divides into smaller sub problems or tasks and gives each test its own compute device. Pipeline parallelism involves doing different kinds of work on a single piece of data. So the pipeline will be considered full when we have enough individual pieces of data so that each piece that's being worked on by one type of work is also waiting for its turn in another type of work. More on this later, Here, we're given an example of where data parallelism is used. In the code we see, we're doing a vector multiplication of two vectors, both with N elements. So this requires N multiplies, assuming we also have N compute devices, we could give each device its own pair of numbers to multiply all at the same time. Another assumption we make is that the results of two multiply is not an input for another multiply. We call this data independence. Here's an example of task parallelism, we begin with a complex problem, it's divided into two simplified sub-problems or tasks. Each of the two sub-problems are given to a separate CPU to work on. At the end, each solution to each sub-problem will be recombined into a final solution. Now to further explain the concept of pipeline parallelism, let's explore the laundry analogy. As shown in the figure, the three tasks that can be performed on a given data stream. For the purposes of our analogy, let task one be the washer print phase, task two would be the dry phase, task three would be the laundry folding phase. And of course the data stream being different sets of clothes such as delegates, non delegates, etc. In this example, full pipeline would be one where maybe the delegates are being washed and soon to be dried and folded. Non delegates already being washed are now in the dryer and a different much earlier set of clothes are now being folded. Now obviously, a processor has specific components dedicated to performing particular tasks, for any given instruction or data stream. When we have things like multiple functions or tasks or sections of code being run at the same time, we usually have to worry about effects of the data resources being shared. For example, if we're trying to run two functions at the same time with one CPU, this will be called resource sharing. If we're expecting the output of a function as the input of another function, then we have data sharing. The two main ways we fix these issues is through locks and barriers. Locks lock up a resource that a function is using until it's done with it and barriers block other functions until they reach the barriers requirement. In summation, parallel computing allows us to optimally use compute resources and have faster running programs and save more power.