Acceleration of H.266 Encoding Using OPENCL And Vectorization with Block Size Variation

No Thumbnail Available

Date

2025-06

Authors

Michael Girma

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Versatile Video Coding (H.266) achieves approximately a 50% reduction in bitrate compared to its predecessor. However, this improvement in compression efficiency comes with a significant increase in computational complexity, presenting major challenges for real-time encoding on general-purpose processors. Most existing H.266 (VVC) implementations rely heavily on CPU-only processing or on vendor specific GPU solutions such as CUDA, which limits portability and cross platform compatibility. Moreover, these approaches often fail to fully utilize modern heterogeneous CPU-GPU architectures, leaving substantial performance potential unexploited. This work proposes an OpenCL-based H.266 encoding solution aimed at delivering high performance, broad cross-platform support, and efficient hardware utilization. Key encoding modules including block partitioning, prediction, transform and quantization, loop filtering, and entropy coding—are implemented as OpenCL kernels to leverage task-level parallelism across both CPUs and GPUs. Additionally, AVX and SSE vectorization techniques are applied on the CPU side to enhance per-core throughput, particularly in compute intensive operations such as transform and quantization. Experimental results across various platforms demonstrate significant performance improvements. On an NVIDIA V100 GPU, the OpenCL-accelerated encoder achieves speedups of up to 7500× compared to a sequential implementation running on an Intel Xeon E5-2698 v4, with peak efficiency observed at a block size of 512×512. Tests conducted on an Intel UHD 620 GPU and an Intel i5-8265U CPU reveal speedups ranging from 15.5× to 370×, depending on the block size. The findings suggest that medium block sizes (64×64 to 256×256) strike the best balance between computational efficiency and workload distribution. While AVX provides only modest gains over SSE, the primary performance bottleneck lies in memory access speed rather than computational power. Overall, the proposed OpenCL-based implementation significantly accelerates H.266 encoding while maintaining high compression quality.

Description

Keywords

H.266/VVC, OpenCL, GPU Acceleration, CPU Optimization, Video Encoding

Citation