OpenCL - 2020

OpenCL - Introduction - 2013

What is OpenCL?

It is "The open standard for parallel programming of heterogeneous systems" according to Khronos Group.

Here is another description from wiki on OpenCL.
OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors.
OpenCL includes a language (based on C99) for writing kernels (functions that execute on OpenCL devices), plus APIs that are used to define and then control the platforms. OpenCL provides parallel computing using task-based and data-based parallelism. Its architecture shares a range of computational interfaces with two competitors, NVidia's Compute Unified Device Architecture and Microsoft's DirectCompute.

OpenCL gives any application access to the Graphics Processing Unit for non-graphical computing. Thus, OpenCL extends the power of the Graphics Processing Unit beyond graphics (General-purpose computing on graphics processing units).
OpenCL is analogous to the open industry standards OpenGL and OpenAL, for 3D graphics and computer audio, respectively. OpenCL is managed by the non-profit technology consortium Khronos Group.

OpenCL is being created by the Khronos Group with the participation of many industry-leading companies and institutions including 3DLABS, Activision Blizzard, AMD, Apple, ARM, Broadcom, Codeplay, Electronic Arts, Ericsson, Freescale, Fujitsu, GE, Graphic Remedy, HI, IBM, Intel, Imagination Technologies, Los Alamos National Laboratory, Motorola, Movidius, Nokia, NVIDIA, Petapath, QNX, Qualcomm, RapidMind, Samsung, Seaweed, S3, ST Microelectronics, Takumi, Texas Instruments, Toshiba and Vivante.

OpenCL 1.1

OpenCL 1.1 includes significant new functionality including::

Host-thread safety, enabling OpenCL commands to be enqueued from multiple host threads.
Sub-buffer objects to distribute regions of a buffer across multiple OpenCL devices.
User events to enable enqueued OpenCL commands to wait on external events.
Event callbacks that can be used to enqueue new OpenCL commands based on event state changes in a non-blocking manner.
3-component vector data types.
Global work-offset which enable kernels to operate on different portions of the NDRange.
Memory object destructor callback.
Read, write and copy a 1D, 2D or 3D rectangular region of a buffer object;
Mirrored repeat addressing mode and additional image formats.
New OpenCL C built-in functions such as integer clamp, shuffle and asynchronous strided copies.
Improved OpenGL interoperability through efficient sharing of images and buffers by linking OpenCL event objects to OpenGL fence sync objects.
Optional features in OpenCL 1.0 have been bought into core OpenCL 1.1 including: writes to a pointer of bytes or shorts from a kernel, and conversion of atomics to 32-bit integers in local or global memory.

GPU Design Tenets

Lots of simple rather than fewer complex powerful computing units.
Simple control for more compute.
Explicit parallel programming model.
Optimize for throughput not latency.

CUDA

Typical GPU program

CPU allocates storage on GPU (cudaMalloc).
CPU copies input data CPU -> GPU (cudaMemcpy).
CPU launches kernel(s)on GPU to process the data (kernel launch).
CPU copies the results back to CPU from GPU (cudaMemcpy).

In general, GPU is good at the following:

Launching a large number of threads efficiently.
Running a large number of threads in parallel.

CUDA (an acronym for Compute Unified Device Architecture) is a parallel computing architecture developed by NVIDIA. CUDA is the computing engine in NVIDIA graphics processing units (GPUs) that is accessible to software developers through variants of industry standard programming languages.
Programmers use 'C for CUDA' (C with NVIDIA extensions and certain restrictions), compiled through a PathScale Open64 C compiler, to code algorithms for execution on the GPU. CUDA architecture shares a range of computational interfaces with two competitors -the Khronos Group's and Microsoft's. Third party wrappers are also available for Python, Perl, Fortran, Java, Ruby, Lua, and MATLAB.
CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs. Using CUDA, the latest NVIDIA GPUs become accessible for computation like CPUs. Unlike CPUs however, GPUs have a parallel throughput architecture that emphasizes executing many concurrent threads slowly, rather than executing a single thread very fast. This approach of solving general purpose problems on GPUs is known as GPGPU.
In the computer game industry, in addition to graphics rendering, GPUs are used in game physics calculations (physical effects like debris, smoke, fire, fluids); examples include PhysX and Bullet. CUDA has also been used to accelerate non-graphical applications in computational biology, cryptography and other fields by an order of magnitude or more. An example of this is the BOINC distributed computing client.
CUDA provides both a low level API and a higher level API. The initial CUDA SDK was made public on 15 February 2007, for Microsoft Windows and Linux. Mac OS X support was later added in version 2.0, which supersedes the beta released February 14, 2008.
CUDA works with all NVIDIA GPUs from the G8X series onwards, including GeForce, Quadro and the Tesla line. NVIDIA states that programs developed for the GeForce 8 series will also work without modification on all future NVIDIA video cards, due to binary compatibility. - from wiki

image source - wiki