{"id":163,"date":"2011-03-21T11:27:57","date_gmt":"2011-03-21T01:27:57","guid":{"rendered":"http:\/\/www.computer-vision-software.com\/blog\/?p=163"},"modified":"2011-03-22T11:23:53","modified_gmt":"2011-03-22T01:23:53","slug":"fine-tuning-of-compiler-options-to-increase-application-performance","status":"publish","type":"post","link":"http:\/\/www.computer-vision-software.com\/blog\/2011\/03\/fine-tuning-of-compiler-options-to-increase-application-performance\/","title":{"rendered":"Fine tuning of compiler options to increase application performance"},"content":{"rendered":"<p>Performance is essential for video analytic applications since algorithms are usually computationally heavy and such systems are supposed to work almost in real time. From one side it can be increased by improving &amp; changing algorithms. This is a major way since it allows to increase performance dramatically. From another side performance can be increased little bit more by relatively simple way \u2013 using of good compiler and by tuning of compile options.\u00a0Let see how it can be done in real programs.<\/p>\n<p><!--more--><\/p>\n<p><strong>For the first example<\/strong> I used LAME encoder (<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/lame.sourceforge.net\/');\"  href=\"http:\/\/lame.sourceforge.net\/\">http:\/\/lame.sourceforge.net\/<\/a>) . Why LAME? First of all because it open source and I can recompile it with different compilers and options. In the second place the simplicity of performance measurement. Performance will be a time required to reencode mp3 file.\u00a0 In the third place it shows well determinate results what allow better understand how different compile options affect speed.<\/p>\n<p>The testing has been performed on computers with different CPUs under Windows operation system.<\/p>\n<p>Intel Pentium 4 3GHz<br \/>\nIntel Core 2 Duo 2.8 GHz<br \/>\nAMD Athlon2x4 (635) 2.9 GHz overclocked to 3.3 GHz<br \/>\nIntel Core i5 (2500) 3.3 GHz<\/p>\n<p>Compilation has been done by VisualStudio9 and GCC4.5.1(using MinGW)<\/p>\n<p>Encoding time has been measured 10 times and average value placed to the table.<\/p>\n<p>As the base 0.00% I used safe options (-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse) that will work on most modern AMD and Intel CPUs. Option -march=core2 may use ssse3 instructions and therefore code may fail to work on AMD and Intel Pentimum 4 family CPUs.<\/p>\n<p><strong>Intel Pentium 4 3GHz<\/strong><\/p>\n<table border=\"1\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td>Compiler<\/td>\n<td>Compiler options<\/td>\n<td>Average Time<\/td>\n<td>%<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse -ffast-math -profile-use<\/td>\n<td>13.206153 sec<\/td>\n<td>-7.83 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse -profile-use<\/td>\n<td>13.537400 sec<\/td>\n<td>-5.52 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse -ffast-math<\/td>\n<td>13.999892 sec<\/td>\n<td>-2.29 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse<\/td>\n<td>14.328020 sec<\/td>\n<td>0.00 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -fomit-frame-pointer -mfpmath=sse<\/td>\n<td>14.621770 sec<\/td>\n<td>2.05 %<\/td>\n<\/tr>\n<tr>\n<td>Visual_Studio_9<\/td>\n<td>\/GS- \/fp:fast \/O2<\/td>\n<td>14.646769 sec<\/td>\n<td>2.22 %<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<ol>\n<li>Optimization to prescott architecture gives 2% speed increase.<\/li>\n<li>\u2013ffast-math gives 2% more<\/li>\n<li>Profile guided optimization gives 5% speed increase<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<p><strong>Intel Core 2 Duo 2.8 GHz<\/strong><\/p>\n<table border=\"1\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td>Compiler<\/td>\n<td>Compiler options<\/td>\n<td>Average Time<\/td>\n<td>%<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=core2 -fomit-frame-pointer   -mfpmath=sse -ffast-math -profile-use<\/td>\n<td>7.818235 sec<\/td>\n<td>-6.65 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=core2 -fomit-frame-pointer   -mfpmath=sse -profile-use<\/td>\n<td>7.824039 sec<\/td>\n<td>-6.58 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse -ffast-math -profile-use<\/td>\n<td>7.893243 sec<\/td>\n<td>-5.75 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse -profile-use<\/td>\n<td>7.976644 sec<\/td>\n<td>-4.75 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=core2 -fomit-frame-pointer   -mfpmath=sse -ffast-math<\/td>\n<td>8.234858 sec<\/td>\n<td>-1.67 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=core2 -fomit-frame-pointer   -mfpmath=sse<\/td>\n<td>8.374867 sec<\/td>\n<td>0.00 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse -ffast-math<\/td>\n<td>8.415269 sec<\/td>\n<td>0.48 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse<\/td>\n<td>8.423270 sec<\/td>\n<td>0.58 %<\/td>\n<\/tr>\n<tr>\n<td>Visual_Studio_9<\/td>\n<td>\/GS- \/fp:fast \/O2<\/td>\n<td>8.814092 sec<\/td>\n<td>5.24 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -fomit-frame-pointer -mfpmath=sse<\/td>\n<td>9.224519 sec<\/td>\n<td>10.15 %<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<ol>\n<li>Optimization to core2 architecture gives 10% speed increase.<\/li>\n<li>\u2013ffast-math gives only 1% increase<\/li>\n<li>Profile guided optimization gives 6% increase<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<p><strong>Intel Core i5 <\/strong><strong>(2500) <\/strong><strong>3.3 GHz<\/strong><\/p>\n<p>There is no special -march option for core i3,5,7 CPUs. \u00a0Option\u00a0-march=core2 can be used for them.<\/p>\n<table border=\"1\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td>Compiler<\/td>\n<td>Compiler options<\/td>\n<td>Average Time<\/td>\n<td>%<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=core2 -fomit-frame-pointer   -mfpmath=sse -ffast-math -profile-use<\/td>\n<td>4.059390 sec<\/td>\n<td>-8.52 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse -ffast-math -profile-use<\/td>\n<td>4.093767 sec<\/td>\n<td>-7.75 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=core2 -fomit-frame-pointer   -mfpmath=sse -ffast-math<\/td>\n<td>4.156268 sec<\/td>\n<td>-6.34 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse -ffast-math<\/td>\n<td>4.200015 sec<\/td>\n<td>-5.35 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=core2 -fomit-frame-pointer   -mfpmath=sse -profile-use<\/td>\n<td>4.253143 sec<\/td>\n<td>-4.15 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse -profile-use<\/td>\n<td>4.321892 sec<\/td>\n<td>-2.61 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse<\/td>\n<td>4.437519 sec<\/td>\n<td>0.00 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=core2 -fomit-frame-pointer   -mfpmath=sse<\/td>\n<td>4.468770 sec<\/td>\n<td>0.70 %<\/td>\n<\/tr>\n<tr>\n<td>Visual_Studio_9<\/td>\n<td>\/GS- \/fp:fast \/O2<\/td>\n<td>4.737522 sec<\/td>\n<td>6.76 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -fomit-frame-pointer -mfpmath=sse<\/td>\n<td>4.815647 sec<\/td>\n<td>8.52 %<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<ol>\n<li>Optimization to core2 architecture gives 9% speed increase.<\/li>\n<li>\u2013ffast-math gives\u00a0 6% increase<\/li>\n<li>Profile guided optimization gives 4% increase<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<p><strong>AMD Athlon2x4 (635) 2.9 GHz overclocked to 3.3 GHz<\/strong><\/p>\n<table border=\"1\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td>Compiler<\/td>\n<td>Compiler options<\/td>\n<td>Average Time<\/td>\n<td>%<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=amdfam10 -fomit-frame-pointer   -mfpmath=sse -ffast-math -profile-use<\/td>\n<td>6.078386 sec<\/td>\n<td>-6.14 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=amdfam10 -fomit-frame-pointer   -mfpmath=sse -ffast-math<\/td>\n<td>6.170114 sec<\/td>\n<td>-4.73 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse -ffast-math -profile-use<\/td>\n<td>6.308954 sec<\/td>\n<td>-2.58 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=amdfam10 -fomit-frame-pointer   -mfpmath=sse -profile-use<\/td>\n<td>6.388826 sec<\/td>\n<td>-1.35 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=amdfam10 -fomit-frame-pointer   -mfpmath=sse<\/td>\n<td>6.476186 sec<\/td>\n<td>0.00 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse -profile-use<\/td>\n<td>6.527979 sec<\/td>\n<td>0.80 %<\/td>\n<\/tr>\n<tr>\n<td>Visual_Studio_9<\/td>\n<td>\/GS- \/fp:fast \/O2<\/td>\n<td>6.942938 sec<\/td>\n<td>7.21 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse -ffast-math<\/td>\n<td>7.293316 sec<\/td>\n<td>12.62 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse<\/td>\n<td>7.372564 sec<\/td>\n<td>13.84 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.1<\/td>\n<td>-O3 -fomit-frame-pointer -mfpmath=sse<\/td>\n<td>7.661477 sec<\/td>\n<td>18.30 %<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<ol>\n<li>Optimization to amdfam10 architecture gives 18% speed increase<\/li>\n<li>\u2013ffast-math gives 5 %<\/li>\n<li>Profile guided optimization gives only 1%.<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<p><strong>Total results<\/strong><\/p>\n<p><strong> <\/strong>Optimization to particular architecture and profile guided optimization may give up to 20 % speed increase.<\/p>\n<p>&nbsp;<\/p>\n<p>As I already said \u00a0LAME is simple example. Let see how performance options affect real video analytic application .<\/p>\n<p>&nbsp;<\/p>\n<p><strong>For the second example<\/strong> I used critical part of real video analytic application (myAudience). It uses boost, opencv and \u00a0ffmpeg libraries. Also it runs in several threads. In comparison with LAME encoder performance measurement for this application was not so simple. Moreover because of inaccuracy of measurements in multithreading dynamic enviroment results were not so well determinate. So I have prepared just one table which shows results in general how I understand them.<\/p>\n<p>Compilation has been done by GCC4.5.2 and GCC4.1.2 on CentOS_5.5<\/p>\n<p>&nbsp;<\/p>\n<p><strong>Intel Core i5 <\/strong><strong>(2500) <\/strong><strong>3.3 GHz<\/strong><\/p>\n<table border=\"1\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td>Compiler<\/td>\n<td>Compiler options<\/td>\n<td>Average Time<\/td>\n<td>%<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.2<\/td>\n<td>-O3 -march=core2 -fomit-frame-pointer   -mfpmath=sse -ffast-math -profile-use<\/td>\n<td>19591.16<\/td>\n<td>-4.90 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.2<\/td>\n<td>-O3 -march=core2 -fomit-frame-pointer   -mfpmath=sse -profile-use<\/td>\n<td>19873.74<\/td>\n<td>-3.53 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.2<\/td>\n<td>-O3 -march=core2 -fomit-frame-pointer   -mfpmath=sse -ffast-math<\/td>\n<td>20010.55<\/td>\n<td>-2.86 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.2<\/td>\n<td>-O3 -march=core2 -fomit-frame-pointer   -mfpmath=sse<\/td>\n<td>20410.36<\/td>\n<td>-2.09 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.2<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse -ffast-math<\/td>\n<td>20410.36<\/td>\n<td>-0.92 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.1.2<\/td>\n<td>-O3 -march=core2 -fomit-frame-pointer   -mfpmath=sse -ffast-math<\/td>\n<td>20532.91<\/td>\n<td>-0.33 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.5.2<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse<\/td>\n<td>20600.55<\/td>\n<td>0.00 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.1.2<\/td>\n<td>-O3 -march=core2 -fomit-frame-pointer   -mfpmath=sse<\/td>\n<td>20816.26<\/td>\n<td>1.05 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.1.2<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse -ffast-math<\/td>\n<td>21962.44<\/td>\n<td>6.61 %<\/td>\n<\/tr>\n<tr>\n<td>GCC4.1.2<\/td>\n<td>-O3 -march=prescott -fomit-frame-pointer   -mfpmath=sse<\/td>\n<td>22221.88<\/td>\n<td>7.87 %<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>What can we conclude after that? Few things<\/p>\n<ol>\n<li>GCC4.5.2 little bit faster than GCC4.1.2 plus it allow to use profile guided optimization, and \u201camdfam10\u201d, \u201catom\u201d architecture options.<\/li>\n<li>Profile guided optimization give about 4% speed increase.<\/li>\n<li>\u2013ffast-math gives about 2% speed increase<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<p>As you can see, tuning compiler options allows to get real improvement in performance, not so huge sometimes, but almost free, so it should be kept in mind.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Performance is essential for video analytic applications since algorithms are usually computationally heavy and such systems are supposed to work almost in real time. From one side it can be increased by improving &amp; changing algorithms. This is a major way since it allows to increase performance dramatically. From another side performance can be increased [&hellip;]<\/p>\n","protected":false},"author":35,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-163","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"http:\/\/www.computer-vision-software.com\/blog\/wp-json\/wp\/v2\/posts\/163","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.computer-vision-software.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.computer-vision-software.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.computer-vision-software.com\/blog\/wp-json\/wp\/v2\/users\/35"}],"replies":[{"embeddable":true,"href":"http:\/\/www.computer-vision-software.com\/blog\/wp-json\/wp\/v2\/comments?post=163"}],"version-history":[{"count":0,"href":"http:\/\/www.computer-vision-software.com\/blog\/wp-json\/wp\/v2\/posts\/163\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.computer-vision-software.com\/blog\/wp-json\/wp\/v2\/media?parent=163"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.computer-vision-software.com\/blog\/wp-json\/wp\/v2\/categories?post=163"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.computer-vision-software.com\/blog\/wp-json\/wp\/v2\/tags?post=163"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}