Optimization of a Parallel Sum of Absolute Difference Algorithm Using OpenCL

ABSTRACT Optimization of a Parallel Sum of Absolute Difference Algorithm Using OpenCL By Tae Kyun Kim This thesis focuses on the development of a parallel sum of absolute difference (SAD) algorithm using Open Computing Language (OpenCL) on an Intel® Core™ multiprocessor platform. Three optimization techniques were examined for performance improvement. The SAD algorithm is used in stereo vision to create pixel-based disparity maps from two concurrent images captured by a pair of cameras positioned with a distance in between. The disparity maps can, then, be used to derive depths of objects in the scenes of interest. The capacity of detecting depth in computer vision is an inexpensive solution to support forward collision prevention and pedestrian detection for Advanced Driver Assistance Systems (ADAS). In this thesis, we surveyed MATLAB’s computer vision modeling literature, investigated the development platform provided by the NXP Semiconductors for ADAS applications, implemented the sum of absolute differences algorithm using OpenCL, and analyzed the resulting performances based on a number of optimization techniques. MATLAB provided the fundamental top-level designs for proof of concept of computer-vision-based ADAS algorithms. But, real-time performance is not guaranteed. The state-of-the-art embedded system supports real-time heterogeneous computing. We investigated how to perform the ADAS applications using an NXP embedded board. However, we could not develop programs using its core vector processors due to licensing restrictions. Instead, we implemented a parallel computational unit targeted at the Intel® multiprocessor platform using OpenCL to generate the disparity maps of the SAD algorithm. OpenCL is an effective parallel programming paradigm that offers a flexible execution framework for users to leverage specific execution resources like vector processors and explicit data streaming in memory hierarchy. It also provides software profiling capabilities to facilitate performance analysis of different optimization methodologies. Three optimization techniques (i.e., loop unrolling, explicit data mapping to local shared memory, and vectorization of processing) were investigated to improve the performance of the parallel SAD algorithm. Loop unrolling reduces the overhead due to loop control, while the latter two optimizing technologies leverage specific features from the targeted platform. We found that vectorization produced the greatest gains, followed by loop unrolling, and shared local memory with speedups of 9.8, 3.1, and 0.7 respectively. We verified that programmer specified explicit data mapping can be detrimental to performance. Hence, we demonstrate the effects of both effective and ineffective optimization methods.