1optimization Tips (for libavcodec): 2=================================== 3 4What to optimize: 5----------------- 6If you plan to do non-x86 architecture specific optimizations (SIMD normally), 7then take a look in the x86/ directory, as most important functions are 8already optimized for MMX. 9 10If you want to do x86 optimizations then you can either try to finetune the 11stuff in the x86 directory or find some other functions in the C source to 12optimize, but there aren't many left. 13 14 15Understanding these overoptimized functions: 16-------------------------------------------- 17As many functions tend to be a bit difficult to understand because 18of optimizations, it can be hard to optimize them further, or write 19architecture-specific versions. It is recommended to look at older 20revisions of the interesting files (web frontends for the various Libav 21branches are listed at http://libav.org/download.html). 22Alternatively, look into the other architecture-specific versions in 23the x86/, ppc/, alpha/ subdirectories. Even if you don't exactly 24comprehend the instructions, it could help understanding the functions 25and how they can be optimized. 26 27NOTE: If you still don't understand some function, ask at our mailing list!!! 28(https://lists.libav.org/mailman/listinfo/libav-devel) 29 30 31When is an optimization justified? 32---------------------------------- 33Normally, clean and simple optimizations for widely used codecs are 34justified even if they only achieve an overall speedup of 0.1%. These 35speedups accumulate and can make a big difference after awhile. Also, if 36none of the following factors get worse due to an optimization -- speed, 37binary code size, source size, source readability -- and at least one 38factor improves, then an optimization is always a good idea even if the 39overall gain is less than 0.1%. For obscure codecs that are not often 40used, the goal is more toward keeping the code clean, small, and 41readable instead of making it 1% faster. 42 43 44WTF is that function good for ....: 45----------------------------------- 46The primary purpose of this list is to avoid wasting time optimizing functions 47which are rarely used. 48 49put(_no_rnd)_pixels{,_x2,_y2,_xy2} 50 Used in motion compensation (en/decoding). 51 52avg_pixels{,_x2,_y2,_xy2} 53 Used in motion compensation of B-frames. 54 These are less important than the put*pixels functions. 55 56avg_no_rnd_pixels* 57 unused 58 59pix_abs16x16{,_x2,_y2,_xy2} 60 Used in motion estimation (encoding) with SAD. 61 62pix_abs8x8{,_x2,_y2,_xy2} 63 Used in motion estimation (encoding) with SAD of MPEG-4 4MV only. 64 These are less important than the pix_abs16x16* functions. 65 66put_mspel8_mc* / wmv2_mspel8* 67 Used only in WMV2. 68 it is not recommended that you waste your time with these, as WMV2 69 is an ugly and relatively useless codec. 70 71mpeg4_qpel* / *qpel_mc* 72 Used in MPEG-4 qpel motion compensation (encoding & decoding). 73 The qpel8 functions are used only for 4mv, 74 the avg_* functions are used only for B-frames. 75 Optimizing them should have a significant impact on qpel 76 encoding & decoding. 77 78qpel{8,16}_mc??_old_c / *pixels{8,16}_l4 79 Just used to work around a bug in an old libavcodec encoder version. 80 Don't optimize them. 81 82tpel_mc_func {put,avg}_tpel_pixels_tab 83 Used only for SVQ3, so only optimize them if you need fast SVQ3 decoding. 84 85add_bytes/diff_bytes 86 For huffyuv only, optimize if you want a faster ffhuffyuv codec. 87 88get_pixels / diff_pixels 89 Used for encoding, easy. 90 91clear_blocks 92 easiest to optimize 93 94gmc 95 Used for MPEG-4 gmc. 96 Optimizing this should have a significant effect on the gmc decoding 97 speed. 98 99gmc1 100 Used for chroma blocks in MPEG-4 gmc with 1 warp point 101 (there are 4 luma & 2 chroma blocks per macroblock, so 102 only 1/3 of the gmc blocks use this, the other 2/3 103 use the normal put_pixel* code, but only if there is 104 just 1 warp point). 105 Note: DivX5 gmc always uses just 1 warp point. 106 107pix_sum 108 Used for encoding. 109 110hadamard8_diff / sse / sad == pix_norm1 / dct_sad / quant_psnr / rd / bit 111 Specific compare functions used in encoding, it depends upon the 112 command line switches which of these are used. 113 Don't waste your time with dct_sad & quant_psnr, they aren't 114 really useful. 115 116put_pixels_clamped / add_pixels_clamped 117 Used for en/decoding in the IDCT, easy. 118 Note, some optimized IDCTs have the add/put clamped code included and 119 then put_pixels_clamped / add_pixels_clamped will be unused. 120 121idct/fdct 122 idct (encoding & decoding) 123 fdct (encoding) 124 difficult to optimize 125 126dct_quantize_trellis 127 Used for encoding with trellis quantization. 128 difficult to optimize 129 130dct_quantize 131 Used for encoding. 132 133dct_unquantize_mpeg1 134 Used in MPEG-1 en/decoding. 135 136dct_unquantize_mpeg2 137 Used in MPEG-2 en/decoding. 138 139dct_unquantize_h263 140 Used in MPEG-4/H.263 en/decoding. 141 142FIXME remaining functions? 143BTW, most of these functions are in dsputil.c/.h, some are in mpegvideo.c/.h. 144 145 146 147Alignment: 148Some instructions on some architectures have strict alignment restrictions, 149for example most SSE/SSE2 instructions on x86. 150The minimum guaranteed alignment is written in the .h files, for example: 151 void (*put_pixels_clamped)(const DCTELEM *block/*align 16*/, UINT8 *pixels/*align 8*/, int line_size); 152 153 154General Tips: 155------------- 156Use asm loops like: 157__asm__( 158 "1: .... 159 ... 160 "jump_instruction .... 161Do not use C loops: 162do{ 163 __asm__( 164 ... 165}while() 166 167For x86, mark registers that are clobbered in your asm. This means both 168general x86 registers (e.g. eax) as well as XMM registers. This last one is 169particularly important on Win64, where xmm6-15 are callee-save, and not 170restoring their contents leads to undefined results. In external asm (e.g. 171yasm), you do this by using: 172cglobal functon_name, num_args, num_regs, num_xmm_regs 173In inline asm, you specify clobbered registers at the end of your asm: 174__asm__(".." ::: "%eax"). 175If gcc is not set to support sse (-msse) it will not accept xmm registers 176in the clobber list. For that we use two macros to declare the clobbers. 177XMM_CLOBBERS should be used when there are other clobbers, for example: 178__asm__(".." ::: XMM_CLOBBERS("xmm0",) "eax"); 179and XMM_CLOBBERS_ONLY should be used when the only clobbers are xmm registers: 180__asm__(".." :: XMM_CLOBBERS_ONLY("xmm0")); 181 182Do not expect a compiler to maintain values in your registers between separate 183(inline) asm code blocks. It is not required to. For example, this is bad: 184__asm__("movdqa %0, %%xmm7" : src); 185/* do something */ 186__asm__("movdqa %%xmm7, %1" : dst); 187- first of all, you're assuming that the compiler will not use xmm7 in 188 between the two asm blocks. It probably won't when you test it, but it's 189 a poor assumption that will break at some point for some --cpu compiler flag 190- secondly, you didn't mark xmm7 as clobbered. If you did, the compiler would 191 have restored the original value of xmm7 after the first asm block, thus 192 rendering the combination of the two blocks of code invalid 193Code that depends on data in registries being untouched, should be written as 194a single __asm__() statement. Ideally, a single function contains only one 195__asm__() block. 196 197Use external asm (nasm/yasm) or inline asm (__asm__()), do not use intrinsics. 198The latter requires a good optimizing compiler which gcc is not. 199 200Inline asm vs. external asm 201--------------------------- 202Both inline asm (__asm__("..") in a .c file, handled by a compiler such as gcc) 203and external asm (.s or .asm files, handled by an assembler such as yasm/nasm) 204are accepted in Libav. Which one to use differs per specific case. 205 206- if your code is intended to be inlined in a C function, inline asm is always 207 better, because external asm cannot be inlined 208- if your code calls external functions, yasm is always better 209- if your code takes huge and complex structs as function arguments (e.g. 210 MpegEncContext; note that this is not ideal and is discouraged if there 211 are alternatives), then inline asm is always better, because predicting 212 member offsets in complex structs is almost impossible. It's safest to let 213 the compiler take care of that 214- in many cases, both can be used and it just depends on the preference of the 215 person writing the asm. For new asm, the choice is up to you. For existing 216 asm, you'll likely want to maintain whatever form it is currently in unless 217 there is a good reason to change it. 218- if, for some reason, you believe that a particular chunk of existing external 219 asm could be improved upon further if written in inline asm (or the other 220 way around), then please make the move from external asm <-> inline asm a 221 separate patch before your patches that actually improve the asm. 222 223 224Links: 225====== 226http://www.aggregate.org/MAGIC/ 227 228x86-specific: 229------------- 230http://developer.intel.com/design/pentium4/manuals/248966.htm 231 232The IA-32 Intel Architecture Software Developer's Manual, Volume 2: 233Instruction Set Reference 234http://developer.intel.com/design/pentium4/manuals/245471.htm 235 236http://www.agner.org/assem/ 237 238AMD Athlon Processor x86 Code Optimization Guide: 239http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf 240 241 242ARM-specific: 243------------- 244ARM Architecture Reference Manual (up to ARMv5TE): 245http://www.arm.com/community/university/eulaarmarm.html 246 247Procedure Call Standard for the ARM Architecture: 248http://www.arm.com/pdfs/aapcs.pdf 249 250Optimization guide for ARM9E (used in Nokia 770 Internet Tablet): 251http://infocenter.arm.com/help/topic/com.arm.doc.ddi0240b/DDI0240A.pdf 252Optimization guide for ARM11 (used in Nokia N800 Internet Tablet): 253http://infocenter.arm.com/help/topic/com.arm.doc.ddi0211j/DDI0211J_arm1136_r1p5_trm.pdf 254Optimization guide for Intel XScale (used in Sharp Zaurus PDA): 255http://download.intel.com/design/intelxscale/27347302.pdf 256Intel Wireless MMX2 Coprocessor: Programmers Reference Manual 257http://download.intel.com/design/intelxscale/31451001.pdf 258 259PowerPC-specific: 260----------------- 261PowerPC32/AltiVec PIM: 262www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPEM.pdf 263 264PowerPC32/AltiVec PEM: 265www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf 266 267CELL/SPU: 268http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/30B3520C93F437AB87257060006FFE5E/$file/Language_Extensions_for_CBEA_2.4.pdf 269http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/9F820A5FFA3ECE8C8725716A0062585F/$file/CBE_Handbook_v1.1_24APR2007_pub.pdf 270 271SPARC-specific: 272--------------- 273SPARC Joint Programming Specification (JPS1): Commonality 274http://www.fujitsu.com/downloads/PRMPWR/JPS1-R1.0.4-Common-pub.pdf 275 276UltraSPARC III Processor User's Manual (contains instruction timings) 277http://www.sun.com/processors/manuals/USIIIv2.pdf 278 279VIS Whitepaper (contains optimization guidelines) 280http://www.sun.com/processors/vis/download/vis/vis_whitepaper.pdf 281 282GCC asm links: 283-------------- 284official doc but quite ugly 285http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html 286 287a bit old (note "+" is valid for input-output, even though the next disagrees) 288http://www.cs.virginia.edu/~clc5q/gcc-inline-asm.pdf 289