Change #264963
| Category | ffmpeg |
| Changed by | Jeongkeun Kim <variety0724@gmail.com> |
| Changed at | Sun 19 Apr 2026 21:27:55 |
| Repository | https://git.ffmpeg.org/ffmpeg.git |
| Project | ffmpeg |
| Branch | master |
| Revision | de18feb0f099154e6ccb6d3746881030d2f53fa1 |
Comments
avutil/aarch64: add pixelutils 32x32 SAD NEON implementation This adds a NEON-optimized function for computing 32x32 Sum of Absolute Differences (SAD) on AArch64, addressing a gap where x86 had SSE2/AVX2 implementations but AArch64 lacked equivalent coverage. The implementation mirrors the existing sad8 and sad16 NEON functions, employing a 4-row unrolled loop with UABAL and UABAL2 instructions for efficient load-compute interleaving, and four 8x16-bit accumulators to handle the wider 32-byte rows. Benchmarks on AWS Graviton3 (Neoverse V1, c7g.xlarge) using checkasm: sad_32x32_0: C 146.4 cycles -> NEON 98.1 cycles (1.49x speedup) sad_32x32_1: C 141.4 cycles -> NEON 98.9 cycles (1.43x speedup) sad_32x32_2: C 140.7 cycles -> NEON 95.0 cycles (1.48x speedup) Signed-off-by: Jeongkeun Kim <variety0724@gmail.com>
Changed files
- libavutil/aarch64/pixelutils.h
- libavutil/aarch64/pixelutils_neon.S