Rhonda Software

Highest quality full cycle software development.

Expert in areas of Computer Vision, Multimedia, Messaging, Networking and others. Focused on embedded software development. Competent in building cross-platform solutions and distributed SW systems.

Offer standalone custom solutions as well as integration of existing products. Opened for outsourcing services.

Visit us at: http://www.rhondasoftware.com

Fine tuning of compiler options to increase application performance

Posted on : 21-03-2011 | By : Alexander Permyakov | In : Uncategorized

2

Performance is essential for video analytic applications since algorithms are usually computationally heavy and such systems are supposed to work almost in real time. From one side it can be increased by improving & changing algorithms. This is a major way since it allows to increase performance dramatically. From another side performance can be increased little bit more by relatively simple way – using of good compiler and by tuning of compile options. Let see how it can be done in real programs.

For the first example I used LAME encoder (http://lame.sourceforge.net/) . Why LAME? First of all because it open source and I can recompile it with different compilers and options. In the second place the simplicity of performance measurement. Performance will be a time required to reencode mp3 file.  In the third place it shows well determinate results what allow better understand how different compile options affect speed.

The testing has been performed on computers with different CPUs under Windows operation system.

Intel Pentium 4 3GHz
Intel Core 2 Duo 2.8 GHz
AMD Athlon2x4 (635) 2.9 GHz overclocked to 3.3 GHz
Intel Core i5 (2500) 3.3 GHz

Compilation has been done by VisualStudio9 and GCC4.5.1(using MinGW)

Encoding time has been measured 10 times and average value placed to the table.

As the base 0.00% I used safe options (-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse) that will work on most modern AMD and Intel CPUs. Option -march=core2 may use ssse3 instructions and therefore code may fail to work on AMD and Intel Pentimum 4 family CPUs.

Intel Pentium 4 3GHz

Compiler Compiler options Average Time %
GCC4.5.1 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use 13.206153 sec -7.83 %
GCC4.5.1 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -profile-use 13.537400 sec -5.52 %
GCC4.5.1 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math 13.999892 sec -2.29 %
GCC4.5.1 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse 14.328020 sec 0.00 %
GCC4.5.1 -O3 -fomit-frame-pointer -mfpmath=sse 14.621770 sec 2.05 %
Visual_Studio_9 /GS- /fp:fast /O2 14.646769 sec 2.22 %

 

  1. Optimization to prescott architecture gives 2% speed increase.
  2. –ffast-math gives 2% more
  3. Profile guided optimization gives 5% speed increase

 

Intel Core 2 Duo 2.8 GHz

Compiler Compiler options Average Time %
GCC4.5.1 -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use 7.818235 sec -6.65 %
GCC4.5.1 -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -profile-use 7.824039 sec -6.58 %
GCC4.5.1 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use 7.893243 sec -5.75 %
GCC4.5.1 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -profile-use 7.976644 sec -4.75 %
GCC4.5.1 -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math 8.234858 sec -1.67 %
GCC4.5.1 -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse 8.374867 sec 0.00 %
GCC4.5.1 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math 8.415269 sec 0.48 %
GCC4.5.1 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse 8.423270 sec 0.58 %
Visual_Studio_9 /GS- /fp:fast /O2 8.814092 sec 5.24 %
GCC4.5.1 -O3 -fomit-frame-pointer -mfpmath=sse 9.224519 sec 10.15 %

 

  1. Optimization to core2 architecture gives 10% speed increase.
  2. –ffast-math gives only 1% increase
  3. Profile guided optimization gives 6% increase

 

Intel Core i5 (2500) 3.3 GHz

There is no special -march option for core i3,5,7 CPUs.  Option -march=core2 can be used for them.

Compiler Compiler options Average Time %
GCC4.5.1 -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use 4.059390 sec -8.52 %
GCC4.5.1 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use 4.093767 sec -7.75 %
GCC4.5.1 -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math 4.156268 sec -6.34 %
GCC4.5.1 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math 4.200015 sec -5.35 %
GCC4.5.1 -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -profile-use 4.253143 sec -4.15 %
GCC4.5.1 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -profile-use 4.321892 sec -2.61 %
GCC4.5.1 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse 4.437519 sec 0.00 %
GCC4.5.1 -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse 4.468770 sec 0.70 %
Visual_Studio_9 /GS- /fp:fast /O2 4.737522 sec 6.76 %
GCC4.5.1 -O3 -fomit-frame-pointer -mfpmath=sse 4.815647 sec 8.52 %

 

  1. Optimization to core2 architecture gives 9% speed increase.
  2. –ffast-math gives  6% increase
  3. Profile guided optimization gives 4% increase

 

AMD Athlon2x4 (635) 2.9 GHz overclocked to 3.3 GHz

Compiler Compiler options Average Time %
GCC4.5.1 -O3 -march=amdfam10 -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use 6.078386 sec -6.14 %
GCC4.5.1 -O3 -march=amdfam10 -fomit-frame-pointer -mfpmath=sse -ffast-math 6.170114 sec -4.73 %
GCC4.5.1 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use 6.308954 sec -2.58 %
GCC4.5.1 -O3 -march=amdfam10 -fomit-frame-pointer -mfpmath=sse -profile-use 6.388826 sec -1.35 %
GCC4.5.1 -O3 -march=amdfam10 -fomit-frame-pointer -mfpmath=sse 6.476186 sec 0.00 %
GCC4.5.1 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -profile-use 6.527979 sec 0.80 %
Visual_Studio_9 /GS- /fp:fast /O2 6.942938 sec 7.21 %
GCC4.5.1 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math 7.293316 sec 12.62 %
GCC4.5.1 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse 7.372564 sec 13.84 %
GCC4.5.1 -O3 -fomit-frame-pointer -mfpmath=sse 7.661477 sec 18.30 %

 

  1. Optimization to amdfam10 architecture gives 18% speed increase
  2. –ffast-math gives 5 %
  3. Profile guided optimization gives only 1%.

 

Total results

Optimization to particular architecture and profile guided optimization may give up to 20 % speed increase.

 

As I already said  LAME is simple example. Let see how performance options affect real video analytic application .

 

For the second example I used critical part of real video analytic application (myAudience). It uses boost, opencv and  ffmpeg libraries. Also it runs in several threads. In comparison with LAME encoder performance measurement for this application was not so simple. Moreover because of inaccuracy of measurements in multithreading dynamic enviroment results were not so well determinate. So I have prepared just one table which shows results in general how I understand them.

Compilation has been done by GCC4.5.2 and GCC4.1.2 on CentOS_5.5

 

Intel Core i5 (2500) 3.3 GHz

Compiler Compiler options Average Time %
GCC4.5.2 -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use 19591.16 -4.90 %
GCC4.5.2 -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -profile-use 19873.74 -3.53 %
GCC4.5.2 -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math 20010.55 -2.86 %
GCC4.5.2 -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse 20410.36 -2.09 %
GCC4.5.2 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math 20410.36 -0.92 %
GCC4.1.2 -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math 20532.91 -0.33 %
GCC4.5.2 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse 20600.55 0.00 %
GCC4.1.2 -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse 20816.26 1.05 %
GCC4.1.2 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math 21962.44 6.61 %
GCC4.1.2 -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse 22221.88 7.87 %

What can we conclude after that? Few things

  1. GCC4.5.2 little bit faster than GCC4.1.2 plus it allow to use profile guided optimization, and “amdfam10”, “atom” architecture options.
  2. Profile guided optimization give about 4% speed increase.
  3. –ffast-math gives about 2% speed increase

 

As you can see, tuning compiler options allows to get real improvement in performance, not so huge sometimes, but almost free, so it should be kept in mind.

Comments (2)

I played around with this kind of options, a long time a go. Now, thanks to your post, I’m going to give it a new try.

Best regards 😉

Nice post, some of those settings made a big difference, and its good to see that newer versions of GCC are faster than VS2008 🙂

Write a comment