ape: switch large filter to 16-bit data and add x86_64 optimisation