Tuesday, September 9, 2014

Linktime optimization in GCC, part 3 - LibreOffice

After Firefox, I decided to look into LibreOffice performance with link time optimizations (LTO) and profile feedback (FDO). This post was in works since April, so my setup is not exactly bleeding edge - GCC 4.9.1 prerelease, LLVM 3.5 and LibreOffice checkout from April. Some real world issues (like a wedding) interrupted my work, however the data presented should be still relevant and interesting.



Update: I fixed the permissions for google drive, so you can see the charts.

Prerequisities

In addition to all dependencies described in Building LibreOffice on Linux: Tips and Tricks the following is needed to get LTO build (copied from Firefox report):

  1. Binutils with plugin enabled. This can be checked by
    $ ld --help | grep plugin
      -plugin PLUGIN              Load named plugin
      -plugin-opt ARG             Send arg to last-loaded plugin
    I personally use gold from GNU Binutils 2.24.51.20140405 but any reasonably recent binutils either with GNU ld or gold should do the job. One needs to however enable the plugin while configuring since it is off by default.
  2. GCC 4.7+ that is configured with plugin support (i.e. built with the plugin enabled LD). For sane memory use, GCC 4.9 is recommended, it requires about 5 times less RAM.

    To test if GCC runs with plugin support, try to link simple hello world application and do:
    $ gcc -O2 t.c -flto --verbose 2>&1 | grep collect2.*plugin
    If the output is non-empty you are having plugin set up correctly.
  3. Clang needs to be built with gold llvm plugin enabled, too. Follow the official instructions.
  4. ar/nm/ranlib wrappers on the path that dispatch to the real ar/nm/ranlib with plugin argument.

    This is somewhat anoying issue. GCC ships with gcc-ar/gcc-nm/gcc-ld, but it is not very easy to convince build script to use them. Adding to path a simple script calling gcc-* equivalent will lead to infinite recursion. I use the following hack:
    $ cat ar
    #! /bin/sh
    command=$1
    shift
    /aux/hubicka/binutils-install/bin/ar $command --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/liblto_plugin.so $*

    $ cat nm
    #! /bin/sh
    /aux/hubicka/binutils-install/bin/nm --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/liblto_plugin.so $*

    $ cat ranlib
    #! /bin/sh
    /aux/hubicka/binutils-install/bin/ranlib $command --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/liblto_plugin.so $*
    In the LLVM case I use the same wrappers, just the plugin path is /aux/hubicka/llvm-install/lib/LLVMgold.so.

    Alternatively the plugin can be installed to the default plugin sear path (/usr/lib/bfd-plugins). This solution has some limitations however. One needs not-yet-released binutils to make this work with GCC and one can not easily install multiple plugins. I hope this situation will change in not so distant future.


    Forgetting these steps will give you undefined symbols as soon as you start to link static library build with -flto and GCC 4.9+ or -flto -fno-fat-lto-objects with earlier compilers. 

GCC 4.9.0 -Os -flto bug

GCC 4.9.0 will segfault building LibreOffice with -Os -flto. I fixed it by this patch that got only into GCC 4.9.1.

    LibreOffice configuration

    I used trunk checkout from Apr 28 2014. I had to use --without-doxygen --enable-python=no for builds to pass. I built with --disable-symbols to get builds faster, especially LLVM LTO needs a lot of extra memory and compile time for debug info.

    In some builds I enabled LTO by --enable-lto and merged libts by --enable-mergelibs.

    I had to disable some unit tests because they were failing in some configurations (the writer tests failed in LTO builds only, while the rest failed in non-LTO, too): CppunitTest_cppcanvas_emfplus, CppunitTest_dbaccess_firebird_test, CppunitTest_dbaccess_embeddeddb_performancetest, CppunitTest_services, CppunitTest_writerperfect_impress, CppunitTest_writerperfect_writer.

    Building with profile feedback

    It seems that never has tried it before. After some experimentation I got LibreOffice building and working with profile feedback. This is not completely trivial:
    1. First configure with profile collection enabled:
      CFLAGS="-O2 -fprofile-generate -fprofile-correction" CXXFLAGS="-O2 -fprofile-generate -fprofile-correction" LDFLAGS="-O2 -fprofile-generate -fprofile-correction" <srcdir>autogen.sh --enable-mergelibs --disable-symbols --without-doxygen --enable-python=no --enable-lto
    2. Unfortunately the tree fails to build, this is because dyamic linker fails with "cannot load any more object with static TLS". While profiling uses TLS, it uses global dynamic model one and I failed to figure out what introduces static TLS into the libraries. In my limited understanding, this problem ought to happen only when objects built without -fPIC are linked into dynamic library and the dynamic library is loaded by dlopen. I build on x86-64 and in this case i would get warnings about wrong relocations used and I am not getting these.

      Anyway, there is ugly workaround to preload the largest library. This will get dynamic linker to allocate enough of space for the tables before application starts:

      CPPUNITTRACE="LD_PRELOAD=<builddir>/instdir/program/libmergedlo.so" make
      I still had to disable couple more unit tests to let the build finish. Something to debug later ;)

      It is a good idea to have DISPLAY environment variable set to working X server. In this case more testing of interactive code is done. I use vncserver to get things trained.
    3. To train the application, I started LibreOffice, went to writer and quit. I assume that the unit testing has pretty good coverage and thus the application is reasonably trained once build finishes. It is still good to be sure that the functions executed at startup are triggered. It also may be good idea to do make check (with the LD_PRELOAD hack) for some additional testing if it was not failing with even more unit tests.

      In fact there is interesting problem with this - the code layout produced by LTO build is likely more optimized for unit testing than the actual application. It may make sense to start LibreOffice enough times so the normal startup prevails or avoid the unit testing and do some more extensive training like Firefox does.
    4. Archive profile files (this step is needed, since make clean will remove them):
      tar czvf `find . -name *.gcda` /tmp/profile.tgz
    5. Build with profile feedback:
      make clean
      sed s/profile-generate/profile-use/g -i config_host.mk
      tar xzvf /tmp/profile.tgz
      make

    Compile time


    LibreOffice can build in two modes - default and mergedlibs. In mergedlibs most of shared libraries are built into one library (libmerged) to reduce dynamic linking overhead. Of course mergedlibs is very interesting for LTO and thus I compiled in most cases LibreOffice in both setting and did most of testing only in mergedlib mode.  Since build takes about an hour, i did not complete the full matrix and chose only combinations I consider interesting.

    There seems to be no major surprises: LTO got significantly faster since GCC 4.8 times (by 32%). This is by disabling fat LTO files and thus not compiling everything twice. This time I did not enable LTO parallelizm (-flto=16), because it is bit hard to glue into the Makefile and as shown later, it is not causing that major difference. Clearly there is thus low hanging fruit to get LTO builds faster by getting the parallelism right. Good news is that even with this simple setup LTO is not significantly slower than normal compilation.

    GCC 4.7 can not build LibreOffice with LTO.

    LLVM at -O2 is about 32% faster than GCC 4.9. Sadly GCC has noticeably slowed down since GCC 4.7 (by 11%). This was not so visible for Firefox compilation and I believe it is caused, by large part, by more expensive initialization at startup for IRA (new register allocator introduced to 4.8) and by bazillions of the new AVX/SSE builtins introduced. GCC 4.8 was a busy release. I did some work on this problem for GCC 5. I will publish update on the build times once I get LibreOffice building reliably with that compiler.


    This graph shows system and user times. Clearly GCC consume more of system time - by requiring separate assembler and by making more pages dirty. Again cost of LTO is not really extreme. I did not investigate why GCC 4.9 mergedbuild got better than others. Seem bit odd.

    A good improvement to LTO builds would probably be to avoid the completely useless assembler stage at compile time - at the moment we produce LTO bytecode and dump it in ascii format for assembler to produce the slim LTO object. GCC has all the code to avoid this and go directly into object files that is used during the linking stage. All that needs to be done is to enable it for slim LTO compilation.

    Memory use during compilation

    GCC 4.9
    GCC 4.9 mergedlibs


    GCC 4.9 LTO

    GCC 4.9 LTO mergedlibs
    Here some extra memory (about 2GB) is needed, but it can be tweaked by reducing parallelism as in the case of Firefox builds. In my setup up to 180 parallel LTO optimization processes was run, something that is definitely not needed or optimal.

    Largest LibreOffice library is just a fraction of size of largest Firefox library, so memory usage do not change that significantly as I reported last time. Obviously the linking occupy time after 2900s and CPU utilization drops that can be tuned at Makefile side. While the LTO builds are not faster than normal builds, I think this milestone can be reached with little bit of extra tuning.

    LLVM

    LLVM LTO, mergedlibs

    As seen in the second graph, GCC 4.9 needs about 1000 seconds for final link stage at LTO. (Linking is easily visible by seeing drop in CPU usage by lost parallelism.) This is about 50% faster than LLVM's.

    LLVM use a lot less memory while compiling and also its compilation stage is faster. For GCC 5 I reduced startup cost of LTO compilation, so hopefully these differences will get smaller. 

    GCC 4.8 LTO, mergedlibs

    While GCC 4.9 got faster at linktime, it got slower at compile time. Again something I hope to track in GCC 5.

    Binary size



    LTO with merged library save about 30MB from the installation directory. Not bad! FDO saves 14MB. Oddly enough these two do are not additive and LTO+FDO build is about the same size as LTO build. Perhaps there are better ways how to train LibreOffice than by the default unit testing.

    GCC clearly needs some tuning for code size - LLVM builds smaller binaries with default setting and bigger with -Os. Also observe that code has grown in size for GCC 4.9. This may be partly my fault - for 4.9 I did not re-tune inliner heuristics (because I left SUSE where I have the setup). It is a good idea to re-tune each release - GCC measure function sizes after early optimizations and as early optimizations get more aggressive each release, the functions get smaller (in GCC's perception) and thus better inline candidates. With each release it is usually possible to trim down the limits somewhat without losing performance on the benchmarks. Doing so is however tedious process, because large body of benchmarking needs to be done (SPEC is not enough) and usually takes me weeks to do. The difference is however big enough so there may be some other bug that went unnoticed into the release :(.


    This graph breaks down libmergedlo (the largest library) size. LTO leads to 17% code size reduction and 15% reduction of the resulting binary. This is definitely nice. The reduction is 7MB overall. This code size difference is a lot better than one I reported for Firefox (however smaller to one reported here). I suspect this is because release builds of firefox are done with identical code folding and section garbage collection in the linker than hits a lot of unreachable code removal. Perhaps something to try with LibreOffice, too.

    FDO works even better with 23% overall binary reduction and 36% of code size.

    Finally LTO+FDO has slightly bigger code size, while overall binary size shrinks by reductions in data segment, relocations and EH.

    LLVM behaves somewhat differently - by default it produces smaller code and LTO makes the code bigger. Also LLVM -Os needs tuning - the binary size is bigger than GCC FDO optimized and definitely slower.

    Memory footprint after startup




    For this I simply opened LibreOffice writer and measured memory in TOP. The memory usage corresponds roughly to the libmergedlo size with exception of -Os builds that consume more memory than FDO builds. This is expected because FDO enables optimized code layout requiring smaller potion of library to be read. Also -Os is not good for code locality by skipping a lot of inlinng and alignment.

    GCC 4.9 mergedlibs saves 5MB, LTO additional 2MB and FDO+LTO additional 6MB. Overall 13MB (18%). Not bad. FDO alone gets close to FDO+LTO build. I also plan to work on static code layout improvements for GCC hopefully getting LTO closer to FDO+LTO scores.

    LLVM with mergedlibs is 3MB smaller than GCC 4.9 with mergedlib and 3MB bigger than GCC 4.9 with -Os and mergedlib (again the code size is probably to blame). LLVM with LTO and mergedlibs beats all GCC builds except one with FDO.

    Runtime

    I am not expert on LibreOffice benchmarking and I plan to update this section with more reliable data. I did very simple test of converting Open document specification into PDF:
    time program/soffice --headless --convert-to pdf --outdir /tmp ~/OpenDocument-v1.2-os.odt

    Judging from the stdout output, this test consists of about 50% of startup time and 50% of conversion time. I measured both hot runs (after first run) and cold runs (with disk cache flushed)


    (Note that the graph does not start with 0)

    Overall I found LibreOffice less sensitive to compiler optimization than Firefox, probably because I found no way to measure any interesting internal loops.

    The mergedlibs build shows small improvement from GCC 4.7 to 4.8. LTO seems to make no difference and FDO leads to just small improvement. LTO+FDO together seems to do a job improving the performance by 5%. The graphs also shows that -Os is not a good way for overall performance.

    LLVM seems to be slightly slower than GCC 4.7 while LTO build is same as GCC 4.8+ non-LTO.


    I also measured cold startup times. This graph has enough noise to be only able to tell that -Os is not making the startup faster.

    Beyond GCC 4.9

    As usual, when looking on a given problem for first time, there is a lot of low hanging fruit. Here are few things implemented for GCC 5.

    Removing write only global/static variables

    I never consider this important - having write-only variables in your program is stupid and no developer would do that? Well, it is not the case of large C++ projects. Both Firefox and LibreOffice seems to have quite few write only variables (in case of Firefox they were accessible only with debugging mode) that occupy BSS segment.  Implementing this simple transformation saved about 12% of BSS space.

    Reducing alignment of virtual tables

    x86-64 ABI requires to increase alignment of all arrays greater than 16 bytes to 16 byte boundary (as opposed to the natural alignment of the data). This decision was made in early 2000s in the anticipation that autovectorization will become more important in future. As an extension, GCC now also aligns every array greater than 32bytes to 32byte boundary and handles structures same way as arrays.

    This turns out to be extremely wasteful for C++ virtual tables (that are arrays of pointers but their accesses will never be vectorized, rather they are kind of random access structures) and for RTTI.

    LLVM aligns to 16 bytes in both cases that seems to cause over 10% difference in data segment size. I updated GCC 5 to align virtual tables to pointer size only.

    Localizing symbols into comdat groups

    C++ inline functions are often duplicated into every translation unit that uses them. These are later merged by linker to avoid duplicates in binaries.  What is however not handled by linker is the case where inline function access other function that is used purely by it.  Such functions may end up duplicated. For this reason I added a simple pass that makes such functions shared, too. This saves additional 0.5%

    Reducing default data alignment

    Data alignment was bumped up to 16 byte in early 2000s to make it possible to safely use SSE accesses to them. This was done for all datastructures above 32 bytes. With arrival of larger vector types (AVX), this was bumped up and in turn caused noticeable bloat. As was benchmarked by Intel guys, this however do not cause any performance improvement - even a decade and half later accessing data by vector register is quite rare event not worth to be optimizing for by default. Moreover vectorizer now knows how to increase data alignment on demand. For this reason this alignment was trimmed down.

    Code and data placement optimizations

    By default GCC puts little effort to code and data placement. Situation is better with FDO, where code is split into hot and cold sections and moreover with FDO+LTO the code gets ordered by approximate execution order. I believe that LTO builds ought to start considerably faster and consume less memory than non-LTO builds (but does not at the moment) and that we should look into an code placement optimizations.

    Identical code folding

    Use of templates in C++ tends to produce a lot of identical functions. Martin Liška works on identical code folding pass in GCC that should be more effetive than linker implementation. Hope it will land into mainline soon.

    Summary

    Both LTO and FDO shows a good improvements on LibreOffice: 36% reduction of code size, 23% reduction in the overall binary, 18% in memory use after startup and 5% in rather lame runtime benchmark. Build times and memory requirements seems to be in reasonable bounds, but some unit testing issues needs to be analyzed before this can be declared production ready. Also FDO needs ugly hack because of static TLS issues that definitely needs to be fixed. It may be interesting to invest some effort into finding decent train run for LibreOffice even though the current unit testing seems to do quite well.

    I hope situation will get better with GCC 5 and I also hope LibreOffice build system will be updated to make FDO builds easy. LibreOffice has definitely turned out to be interesting target to optimize for and I will try to keep eye one it during the GCC 5 development cycle.

    GCC LTO alone is a good tool to reduce code size, but runtime performance and memory use does not change much. At least I saw no off noise changes in very limited benchmarking I did. With LLVM the situation is oposite: LTO leads to small code size increase but also to measurable performance improvements (so the performance gets similar to GCC without LTO).

    More direct comparison of GCC and LLVM produced code would be interesting - LLVM seems to do better on code size at -O2 (by about 3%) and subsequently needs less memory for application to load. Even if the resulting code runs a bit slower, there may be more stupid and easy to fix issues that makes GCC code bigger. Quick look at the produced binaries identified few issues I fixed for GCC 5. We also may want to revisit our tuning tradeoffs because most of them was set over a decade ago and tested on much smaller benchmarks. While GCC 5 produced binary is now smaller than GCC 4.8 one (resolving the regression) it is still bigger than LLVM's.

    LLVM also does traditionally better at compile time (by about 35%) and memory usage. Sadly there is very noticeable regression in GCC 4.8 (increasing compile time by 11%). I am in progress of analyzing and fixing the compile time issues for GCC 5. On the other hand, GCC gets faster LTO linktimes when parallelism is enabled (or with debug symbols that I did not analyze in detail in this post) making it more useful for day to day development.

    I will try to update on GCC 5 performance once I get all trees updated and working again.

    15 comments:

    1. Cannot see the google docs spreadsheets. Opening in new tab shows a permission dialog asking for access

      ReplyDelete
    2. This comment has been removed by the author.

      ReplyDelete
      Replies
      1. Oops, sorry for that. I fixed the permissions!

        Delete
    3. still half of the graphs aren't visible. for example the ones under compile time, binary size, etc. maybe author should look at his post on another system (vm would suffice) which doesnt have google login and images cached.

      ReplyDelete
      Replies
      1. Sorry to hear that. I already got some feedback about the graphs, so they seem to be visible to some. Perhaps just reload will help?

        Perhaps i should switch from those live graphs to embedded images - only feature I like about them is that you can hover over the bars and get full numbers. But I am not that happy about how they are rendered and they are slow...

        Delete
    4. have you been able to link firefox with gold ?
      i'm always running into this bug:
      https://sourceware.org/bugzilla/show_bug.cgi?id=16728

      ReplyDelete
      Replies
      1. Yep, i am building firefox with gold and it works for me since 2010. Maybe a slightly different setup.

        Delete
    5. I had a question about how LTO will optimize a specific use-case.
      On reddit because blogspot comment interface sucks

      ReplyDelete
    6. ...for IRA (new register allocator introduced to 4.8) ...
      I guess you want to say LRA not IRA, it's a replacement for reload, just half part of register allocator. IRA was introduced in 4.4 :)

      ReplyDelete
      Replies
      1. Indeed, I meant LRA. What changed is the way register move cost and classes are computed. This is essentially O(N^4) algorithm in number of register classes. Sadly x86-64 has enough of register classes to make this matter.
        I made patch to old regclass that avoided computation of the full matrix but that got lost because LRA code branched earlier. Need find time to update it for the new code.

        Delete
    7. Hi Jan, the ranlib wrapper above misses the assignment to $command like the one in ar. By the way, do you know what's the status of LTO on ARM using thumb-2 instructions? https://gcc.gnu.org/bugzilla/show_bug.cgi?id=45729 Because thumb-2 gives strong savings on code size, I wonder how smaller code could get with (LTO -Os) or (LTO -O2) applied.

      ReplyDelete
    8. Hi Jan, the ranlib wrapper script is missing the assignment of $command and the shift to maintain the commands as the first parameter. By the way, do you know if LTO is playing well on ARM with thumb instructions enabled? As I experiment with it, using LTO or thumb gives similar savings on the major WebKit library, so I wonder what would be the resulting size by applying both simultaneously (but assembler is failing on 4.9.2 if combining thumb + LTO). https://gcc.gnu.org/bugzilla/show_bug.cgi?id=45729 . Also, even though I see a 30% library size decrease, runtime memory consumption is decreased by only 2%. To get good runtime memory savings I believe LTO must be used with FDO.

      ReplyDelete
      Replies
      1. Thanks, I will update it. Hopefully these ugly scripts will not be needed anymore.

        As for LTO and non-x86 architectures in general, the only bigger limitations I am aware of is the lack of any logic to distribute section anchors across the partitions. This seems to manifests as bit bigger gap in between -flto and -flto -flto-partition=none seen on ARM compared to x86. Except for need to pickle LTO bitcode in object file, LTO is quite target independent feature.

        LTO on ARM should work quite well (i.e. GCC bootstraps/regtests and build smaller apps). I still did not set up my chromebook to be able to build firefox or libreoffice, but plan to try so in GCC 5.0 timeframe. Main limitation here is the 2GB memory footprint limit.

        I am not aware of any issues with thumb, if you hit any, just please open new PR. Not sure why PR45729 is resolved as invalid?

        Delete
    9. JFTR the ranlib shell wrapper is incorrect. It misses the: "command=$1; shift" part.

      ReplyDelete
      Replies
      1. It works as posted, but indeed it should be cleaned up (will update the particle). The difference is that ar takes unconditionally the first parameter as command, while ranlib and nm doesn't and the plugin parameter can appear first for these.

        Too much cut&paste in that hack. Luckilly the gcc-nm/gcc-ar/gcc-ranlib wrappers shipped with GCC works well these days.

        Delete