Sunday, April 12, 2015

Link time and inter-procedural optimization improvements in GCC 5

GCC-5.1 release candidate 1 just branched. Lets take a look what changed in the inter-procedural optimization (IPA) and link-time optimization (LTO) frameworks.


Link time optimization (LTO) in GCC 4.5 - GCC 4.9

Last year I wrote quite detailed post on the history of LTO in GCC. Here is a quick summary of where we stand now.

Reliability and scalability

Link-time optimization was merged into GCC 4.5. At this time it was able to compile "simple" standard conforming C/C++/Fortran programs (for example SPEC2006 benchmarks), but failed on many real world applications by both correctness and scalability problems.

GCC 4.6 added support for parallel link-time optimization (WHOPR) that was designed to deal with large applications. It took couple months to get GCC 4.6 to build applications as large as Firefox (with several hacks and workarounds). Even with WHOPR working as designed the compilation remained slow for large projects with main bottlenecks being the time needed to stream in all types and declarations.

GCC 4.7 to 4.9 gradually improved and today GCC is able to build quite good portion of the GNU/Linux operating system with LTO enabled including the Linux kernel, Firefox, Chromium or Libreoffice. GCC 4.7 was barely useful to build such large applications, while GCC 4.8/4.9 got build times and memory use under control. According to Andi Kleen, the linux kernel build time is now only about 2%-15% slower (the smaller number is for usual configuration, larger for allyesconfig). This is down from 108% with GCC 4.8.

Thanks to the WHOPR model, GCC at link-time now typically require less memory and wall time than LLVM and without the parallelism consume typically "just" about 2 or 3 times as much memory as the linker. There is still a lot to improve. For example, LTO is reported to not be scalable enough for Google internal application.

Google is now working on similar goals on LLVM side. ThinLTO to be discussed on the next LLVM developer meeting and I am very curious about that presentation. Google driven the early stage of WHOPR implementation (it got almost complete rewrite since then) as well as a separate LIPO framework and I am curious whats next.

Code quality

LTO remains more expensive than per-file model especially with compile-edit cycle. Without a significant benefits there is not much motivation for developers to adopt it which leads to chicken-egg problem because LTO will not improve without being used and tested.

See, for example, Linus' no on the kernel LTO changes last year. Linus' opinion did not make me happy but was kind of justified. Andi Kleen's work on LTO for kernel took 3 years and he greatly helped to get LTO from early prototypes into shape it is today. I am happy that most of LTO changes found its way to kernel since then.

While majority of traditional inter-procedural optimizers in GCC was prepared to operate whole program, dropping a whole program into optimizer is not necessarily a win. It is easy for heuristics to get out of hand and produce huge spaghetti that does not work any better and is pain to debug.

The code quality benefits enabled by LTO require careful retuning of the whole optimization queue. My posts on Firefox and Libreoffice shows code quality improvements by LTO and also the progress between GCC 4.8 and 4.9.

To briefly summarize my impressions. GCC's LTO is quite good code size optimization: 17% code size reduction on Libreoffice (and 36% if profile feedback is enabled) is not bad at all. The performance implications are harder to understand though. The gains depend on how well given application was hand-tuned for per-file compilation model (technically given enough effort, most of link-time optimizations can be done by developers by carefully factoring the code into translation units and using inline code in headers). 3-5% improvement of SPEC benchmark may seem small to non-compiler developer, but it actually represent an important code quality change. Adopting LTO thus means change in development practices and focus. Manual optimization effort is good for small hot spots of the program, while LTO is very useful is to improve quality over the whole binary. Hopefully designing programs for LTO will give developers more time to concentrate on bigger picture rather than micro-optimizations in a longer run.

Changes in GCC 5

GCC 5 is definitely exciting release from IPA and LTO point of view. With more than 1400 patches to link-time and inter-procedural optimization framework it fixes several long standing problems, improves scalability and adds new features.

Lets take the look at the passes in the order they are executed. In green I also add some statistics about individual changes where I know them.

New Auto-FDO support

Feedback Directed Optimization (FDO) is a compilation model where the program is first compiled with profiling instrumentation and run on a sample of typical input (train run). In the second step the binary is recompiled enabling compiler to use the profile gathered during the train run (to guide optimizations such as inlining, loop vectorization/unrolling and other changes). While FDO was implemented in GCC for over a decade, it had long way to go from benchmarking toy to a practical tool. It found its use in places where performance matters (such as for Firefox, GCC itself, databases, and interpreters; Google is known to greatly depend on FDO internally by using a custom LIPO translation model and contributing many improvements to GCC FDO infrastructures over years).

FDO is great for code size and performance. I is however held back because it is difficult to use. One of main issues is that profile collection needs to be part of the build process (any source code, configuration or compiler change invalidates the profile). This makes it hard to deploy FDO in distribution builds. For example, Firefox needs a running X server to train (and will silently produce slow binary if you build it without X available). The profile collection is more difficult in cross-compilation environment and running the profiling runtime at embedded systems may be impossible. Instrumented application also runs significantly slower that may affect its behaviour and lead to misguided profile.

These problems are addressed by AutoFDO that uses low overhead profiling on instrumented program to collect the profile. Resulting profile has rather simple format tagged to source code line numbers and is thus not as precise as in the case of classical instrumentation. It is faster to collect and is more robust for source code and environment changes that makes it possible to ship with application's source code and reuse for future builds in individual distros.

This infrastructure was developed by Dehao Chen at Google and existed in Google's GCC branch for a while (used, for example, for Chromium builds). Profiling is currently limited to Intel chips (because it uses profiling feature tracing branch history), but the profile can be applied to non-x86 builds too. I am happy it finally found its way to mainline GCC this release cycle. (It was also ported to LLVM 3.5, but LLVM has still some way to go to get full FDO infrastructure at place)

The classical FDO instrumentation measure more than basic block and edge counts. It also looks for expected values passed to certain instructions (such as divisions), collect information about execution order and others. I think there is a lot of room for future improvements by possibly combining both infrastructures. Clearly also the instrumented profile can be easily lowered into auto-FDO format and shipped with a binary.

Google reports 4.7% performance improvement of SPEC2000 with auto-FDO, compared to 7.3% improvement given by FDO. Both on Intel CPU with LTO enabled.

Classic FDO

FDO implementation in GCC also continues envolving. GCC 5 has several fixes and improvements to profile quality, for example when profiling extern inline functions. With LTO the profiles are now almost right for C++ inlines. Rong Xu (Google) also contributed new gcov-tool utility that lets you play with the profiles. Several patches by Teresa Johnson (Google) and others also improve the code updating profile after transformations (especially after jump threading).

One measurable improvement is the increase of successfully speculated indirect calls in Firefox from 22138 to 32101, that is by 45%. I am bit surprised by this. Partly it can be accounted to profile improvements and partly perhaps to more early inlining discussed bellow.

New bounds checking infrastructure via Intel's MPX extension

New bounds checking infrastructure enables GCC to use Intel's MPX extension.While there is no hardware supporting this extension available yet, it will hopefully soon serve as an interesting alternative to the address sanitizer implementation.

This pass was contributed by Ilya Enkovich (Intel) and imply some rather interesting changes across the inter-procedural optimization framework (which gave me a bit of headache) adding support a function that have two different calling conventions with one entry point. This will be useful elsewhere too, in particular it will let us to finally solve a long standing LTO bug about false/missed warnings with FORTIFY_SOURCE

Early optimization

Early optimization is set of mostly intra-procedural optimizers that is still run at compile time even during LTO optimization. These optimizations include:
  1. early inlining, 
  2. pass to determine memory location sizes, 
  3. constant propagation, 
  4. scalar replacement of aggregates, 
  5. alias analysis, 
  6. global value numbering driven full redundancy elimination,
  7. dead code elimination with control dependency, 
  8. inter-procedural scalar replacement of aggregates,
  9. tail recursion optimization (turning tail recursion to loop in some cases),
  10. profile estimation, and
  11. discovery of nothrow, pure and const functions.
  12. unreachable function and variable removal.
These are mostly intra-procedural optimizations and changes in these is out of scope of my post (it is waay too long anyway). Their purpose is to perform the "obvious" transformation (in a sense that they improve both size and runtime) and hammer down the abstraction penalty which increases a chance that the later heuristics will not be lost and will do the right thing.

Noteworthy changes motivated by LTO include more agressive inlining by early inliner, stronger devirtualization, and better discovery of pure/const functions.

It is hard to quantify effect of early optimizations in an isolation. GCC IPA profile compute estimated size and runtime of the program based on feedback data early during the optimization queue. This statistics produced by GCC 5 while linking Firefox shows 24062594864 time and 7985344 size. GCC 4.9 reports 27206581886 time and 8071488 size. So there seem about 1% reduction of number of instructions in the program representation at this time but they are significantly faster (by about 15%). This seems expected - with more inlining one should see more faster instructions. Improvements in the optimizations apparently completely masked the bloat by inlining.

OpenMP offloading

New infrastructure for offloading was added. Contributed by Jakub Jelinek (Red Hat), Bernd Schmidt and Thomas Schwinge (CodeSourcery), Andrey Turetskiy, Ilya Verbin and Kirill Yukhin (Intel). This infrastructure makes it possible to stream out selected functions and load them into separate compiler that produce code for an accelerator. More info is available on GCC wiki page. Current implementation targets Xenon Phi, but the basic infrastructure is vendor independent.

Whole program visibility (privatization)

This is the first pass run at the link-time. For a first time the compiler sees whole program and also gets a resolution file (produced by the linker plugin) specifying usage of the symbols. For each symbol it is not know if it is used by LTO code alone or if it can be bound by non-LTO portion of the program either by static or dynamic linking. LTO only symbols are turned into static symbols enabling more optimization to happen. Unreachable function/variable removal is re-run and typically remove a lot of stuff that could not be removed at compile time.

Whole program visibility improved in GCC 5 in several subtle ways generally too technical to discuss here.  Comdat groups are now dismantled enabling unreachable code removal inside of them; references to global symbols are turned to local aliases where possible speeding up dynamic linking, etc. One of not so positive consequences is this bug, that shows that attempt to mix incremental linking with LTO without binutils support fails more miserably now than it did with earlier releases.

The overall size of dynamic relocations reduced by about 2% in Firefox binary, but the improved visibility pass has another nice consequences. Other applications may improve more, Firefox already links everything into one large DSO and most of relocations are IP relative.

Write only variables removal

GCC was always able to eliminate write only local variables. Elimination of unused static variables was added about a decade ago. Elimination of write only global variables however did not really fit the WHOPR structure and I never considered it important. I was proved wrong while looking into the data segment size of Firefox' javascript interpreter. A good part of it is a single write only static variable that is used only if DEBUG macro is defined. A similar story is also true about Libreoffice. I thus cooked up a support for this.

Identical code folding

Identical code folding is a new pass (contributed by Martin Liška, SUSE) looking for functions with the same code and variables with the same constructors. If some are found, one copy is removed and replaced one by an alias to another where possible. This is especially important for C++ code bases that tend to contain duplicated functions as a result of template instantiations.

In its nature this pass is similar to Gold's identical code folding feature. It however run a lot earlier - just before inlining and cloning. This make the code reuse explicit to the code duplicating passes and has chance to lead to better optimization decisions. GCC is also better than Gold in proving what transformations are safe and what now. On the other hand proving that two functions are identical in compiler is much harder than comparing a binary blobs with relocations though. Not only the instructions needs to match each other, but all the additional meta-data maintained by the compiler needs to be matched and merged. This include type based aliasing analysis information, polymorphic call contexts, profile, loop dependencies and more. For this reason the pass does not replace Gold's feature. The initial implementation is rather simple (and even with that it was hard to get working for GCC 5) and will see incremental improvements in next releases.

ICF also may be anoying while debugging programs, because the backtraces point to seemingly nonsential functions. There are DWARF extensions intended to handle this and some ideas are discussed in this PR. I however hope that developers will get used to the feature, like they did on some other platforms where it enabled by default for long time. ICF can be controled by -fipa-icf. More merging can be enabled by -fmerge-all-constants (provided that your program never compare addresses of two static variables) this is equivalent of Gold's agressive --icf=all, but only for variables. Option controlling function merging will be added for GCC 6.

ICF remove about 29000 functions out of 282453 functions surviving the early unreachable code removal. This is about 10% of all functions. Sadly removing 10% of functions does not translate to 10% code size reduction. Usually only the small functions gets removed and also Firefox links itself with identical code folding optimization in Gold. There seems to be some issues to be solved, still. While Gold's ICF reduce Firefox binary by about 9%, GCC ICF is less effective. GCC ICF however seems to shine on Chromium, where it saves about 10% of code size in addition to Gold's ICF.

Devirtualization

Devirtualization pass is a special purpose pass intended to turn indirect calls to virtual functions to direct calls. It consists of five main components:
  1. the type inheritance graph builder,
  2. an analyser of polymorphic call context that is able to track real type of instance,
  3. a new propagation engine enabling easy forward propagation of the polymorphic call contexts across the program,
  4. type inheritance graph walker that given the context construct list of possible polymorphic call contexts.
  5. ipa-devirt pass that speculatively devirtualize in the case there is only one reasonable candidate on the list of possible polymorphic call contexts.
The details can be found in my series on devirtualization. Being a new pass, it received a lot of attention. Here are main changes:

1) Stronger polymorphic context analysis: Every polymorphic call has a type (that is the pointer-to type of the class pointer used to call the virtual method) and a token (index in virtual table). The purpose of polymorphic context analysis is to annotate the call with additional information about the type of instance the polymorphic call refers to.

GCC 4.9 basically looked up if particular instance is within a static or automatic variable and if so, it performed some further analysis whether the type is possibly in construction and assigned types accordingly. The common case of allocation via new was cowardly ignored.

GCC 5 can now determine the type of an instance from looking up the constructor call. For example, it does track type in the following innocent testcase that was out of reach of earlier analysers:
int foo(void)
{
  struct A *a = new (A);
  a->virtual_method(); 
     // This call will be devirtualized
     // older GCC will devirtualize only if ctor of A is inlined

  external_call(a);
  return a->virtual_method();
     // This call will be devirtuazed speculatively

     // Full deivrtualization is not possible
     // because external_call may use placement new on a

}
The effect of this change can be easily checked by looking into -fdump-ipa-devirt logs of compilation of larger C++ programs. The dump contains a summary.

Summary building Firefox with GCC 4.9:

115554 polymorphic calls, 36028 speculatively devirtualized, 2980 cold, 75887 have multiple targets
... and the same summary with GCC 5:
112733 polymorphic calls, 42131 speculatively devirtualized, 2951 cold, 66637 have multiple targets

14% fewer calls remains without a speculation after the ipa-devirt pass.
The number of speculative devirtualizations also improved by 16%. By removal of now useless polymorphic call targets, the number of functions drops from 221880 to 198467, 11%.

Out of the 112k calls, 24k are calls from instance passed as function argument and those are further analyzed during constant propagation and inlining via the new propagation engine.

2) Type inheritance graph with non-polymorphic types: identifying types by one definition rule in the link time optimizer is rather difficult job, because it is difficult to say what is the name of the type (especially in the context of templates where template parameters are also part of the name). Jason suggested an alternative solution, where C++ mangler is used to store mangled names of all non-anonymous types and the link time optimizer simply unifies them by string equality. This works great, even though it increases the amount of streamed data by about 2-3% and is now used by default by GCC (controlled by -fodr-type-merging flag). Currently the infrastructure is used only for diagnostics, but I have a prototype implementing better type based alias analysis for it.

3) Better diagnostics: I also put a lot of work into new warning about One Definition Rule violations (-Wodr) that already helped to chase some evil bugs out of Firefox and Libreoffice. Note that the quality of warnings significantly improved since my original blog post. This warning is more robust and informative than -detect-odr-violations implemented by Gold and safe enough to be enabled by default (you can use -Wno-odr to silence it). Similar option was also discussed for Clang, but as far as I know not implemented yet.

Another new warning helps users to annotate programs with C++11 final keyword (controlled by -Wsuggest-final-types and -Wsuggest-final-methods).

4) Earlier unreachable code removal: devirtualization makes most sense when the called method gets inlined. For this compiler needs to hold bodies of all virtual functions that may be potentially reached by the virtual call in hope that some of the calls will be devirtualized to them. Because these virtual functions may refer to other code that can be otherwise optimized out, this may easily turn into an expensive feature.

GCC 4.8 and earlier simply kept every virtual method alive until after inlining to make devirtualization possible. GCC 4.9 introduced the type inheritance graph and construction of "may" edge in the callgraph - that is the construction of full list of possible targets of every polymorphic call and remove those virtual functions that can not be called. Surprisingly enough this turned out to be one of major performance improvements of LTO scalability for Firefox.

New GCC push this even further by both stronger analysis (and thus getting shorter lists of possible targets of a call) and by dropping all methods that are known to be never devirtualized to by later propagation passes just after the inter-procedural devirtualization pass. This is based on an observation that the inliner and the inter-procedural constant propagation pass can devirtualize only those calls that can be tracked to instances passed by function parameters.

This further reduces the amount of code to track by later inter-procedural passes by about 10% on firefox.

Constant propagation pass

Constant propagation pass, as name suggest, propagate constant values passed in parameters across the program. When it proves that some parameter is always a known constant, it eliminates it.

Implementation in GCC-4.9 is a lot smarter than that. It is also capable of function cloning in case the known constant come only from subset of all callers of the function, implements propagation of constants passed by aggregates (that is important for C++ classes and Fortran array descriptors) and does simple inter-procedural devirtualization by forward propagating known type of the particular instance.

Martin Jambor (SUSE) reorganized the pass to be more modular and allow propagation of various information across the program. Now the propagation is split into 3 propagators:
  1. propagation of constants and constants passed in aggregates
  2. propagation of polymorphic call contexts for inter-procedural devirtualization (replacing the old instance type propagation that was never that useful)
  3. propagation of known alignments of pointers to improve vectorization.
For next release we hope to add propagation of value ranges and other tricks.
Whole compiling Firefox, the constant propagation pass modify 18412 function bodies (4917 require cloning) while GCC 5.0 50761 (6267 require cloning). This is 175% more. Number of devirtualized calls increased from 428 to 1204, 182% improvement. GCC 5.0 also does extra 941 devirtualizations as an effect of constant propagation. 

Number of function parameters turned into a constant and removed however dropped from 18412 to 16731; something I do not quite understand and need to look into.

Inliner

Inliner is by far the most important inter-procedural optimization pass of every modern compiler. This is because most of optimization still happens intra-procedurally and inlining is essential to make it trigger. Modern programs use a lot of function call based wannabe zero cost abstractions that constantly bring new challenges to the inliner heuristic implementations.

The main problem of inliner is that while its decisions have huge impact, it is not really well informed about what it is doing.It is hard to predict optimizations that arise from particular inlining decisions. Most of the time inliner things that it does a lot of code duplication to just eliminate the function call overhead which does not match reality. Most inliners, including GCC one, are able to predict simple optimizations (such as unreachable constant removal caused by constant propagation across function arguments) but they can't predict more complex patterns where, for example, all calls of methods on a particular instance are must be inlined to make the actual instance to be completely optimized out.

Since 2003 the main inliner in GCC is organized as a simple greedy algorithm trying to do as many useful inlines within the bounds given by several --param values. Those limit size of inlined functions, overall unit growth and growth of large functions. (This differ from LLVM's inliner that works exclusively in the top-down order interleaved by other local optimizations similarly to GCC's early inliner. Greedy inliner for LLVM is being disussed here, but is more similar to the inliner in Open64).

In GCC 5 timeframe I put a lot of work into tuning inliner for LTO builds (without ruining performance of non-LTO) that has proved to be quite hard job to do. I will report on this separately - while there are still several issues (and probably will always be), the size of LTO optimized programs reduced nicely.

Nice work was also done by Trevor Saunders, Mozilla, and Martin Liška, SUSE, on turning the inline metric to real numbers from original fixed point arithmetics. This makes it a lot easier to work on it and chase away strange roundoff/capping artefacts. This is one of good achievements of C++ conversion - while I considered it for a while, it seemed ugly to implement. GCC can not use native float and double types, because these are not consistent across different hosts. Using hand written reals in C is ugly and moreover the Fibbonachi heap was written to work only on integer. Implementing C++ class for simple reals and turning the heap to a template did the trick. In next release cycle I plan to follow on this and turn more of calculations into the reals.

The code size of Firefox binary built with -O3 -flto reduced from 5MB (GCC 4.9) to 4.6MB (GCC 5), by 10%. With FDO the reduction is from 4MB to 3.5MB (14%). Firefox built with -O3 -flto by LLVM has 5.3MB of code segment and with FDO 5.4MB (LLVM's inliner does not really know about profile feedback yet). I am still chasing some bechmark regressions compared to 4.9 and thus these numbers may or may not change in final GCC 5 release.

IPA-reference

IPA reference is a simple pass that collect for a function list of static variables that are known to not be accessed/modified by it. This information is then passed down to the optimizers that can better eliminate loads and stores.

The problem of ipa-reference is that it uses quite simple minded sparse bitmaps to represent the information. On large programs with many variables and many function this easily ends up being quadratic. A simple change to prune out early variables whose address is taken and can not be analysed in this simple way saved a lot of memory and compile time on Chromium and Linux kernel.

I really hope this pass will be replaced by smarter pass in a foreseeable future proper mod-ref pass.

Comdat localization pass

Inline functions in C++ headers ends up being compiled in every translation unit that uses them. To avoid duplication in the final binary, they are stored in special sections (COMDAT groups) that are later merged by the linker. An artiface of this process appears when function calls other functions outside of the comdat section and is the only user of it. Such functions will not be merged. The new pass simply adds such functions into respective comdat groups.

The reduce non-LTO firefox binary by 1% and libreoffice by 0.5%. It also has some potential for future improvements by tracking also variables and constants used purely by the function. Those are very common and I noticed the problem too late for GCC 5.

Other improvements

OK. I have completed describing noteworthy changes in individual optimization passes. There are however number of changes in the LTO infrastructure in general. Few of them are worth to mention:

Streaming command line options

Moving compilation to link-time opens a huge can of worms with respected to compile time options. These options control optimization, but also specify details of the semantics of the program (for example -ftrapv says that program should trap on overflow).

So far GCC behave in a way that some command line options are applied at compile time, while some are applied at link time and some partly at both. This brings surprises and wrong code issues. For example, some makefiles do not pass optimizations options to the linker and thus the resulting binary is not optimized. Worse yet, some options are not optional, but mandatory for code correctness. Compiling program with -fno-strict-aliasing and linking without will make LTO optimizer still assume that the program is strict aliasing safe and will lead to wrong code.

GCC 5 now address the problem by annotating every function by corresponding optimize and target attribute built from compile time options. Thus all optimization options are streamed from compile time and honored at link time. This is an important change in behaviour and will likely lead to some surprises (for example cross-module inlining of function built with -fno-strict-aliasing to function built without no longer happens), but it is also huge step forward in maturity and safety of the implementation.

A good news is that even without LTO the quality of implementation of target and optimize attributes improved (because of a lot extra testing they receive now) and you can almost expect them to work as expected when used ;)

Similar solution is apparently being discussed for LLVM, too.

Firefox is an example of program that uses different command line options across different units. Most of Firefox builds with -Os and javascript engine with -O3. Because link-time compilation is invoked with -O3, the Firefox binary is "mostly -O3 optimized".

Firefox built with GCC 4.9 has 7.7MB of code, while GCC 5 gets 5.9MB, 30% less. LLVM 3.7 compiled binary has 6.6MB. Sadly this makes also some of Firefox benchmarks to regress because they trip a lot of size optimized code. I guess it is something to judge among Firefox developers and possibly revisit placement of optimization flags. My understanding that some of performance in those benchmarks is intentionally sacrificed to reduce the binary size.

Building the whole Firefox with -O3 leads to code segments of 8.2MB (GCC 4.9), 7.7MB (GCC 5), and, 8.3MB (clang 3.7). GCC sees more similarity between both builds than clang, probably because most of inlining with clang happens at compile time while GCC inlines at compile time only for size and -O3 inlining happens at link-time.

Unfortunately until GCC IR is extended to handle most of relevant options, this also may lead to degradation of perfomrance, because certain options prevent inlining. Disabling this feature reduce Firefox code segment size by about 5%, so the correctness does not come without a cost. Maybe Firefox developers could drop using -fno-strict-aliasing at random portions of the project.

-fno-semantic-interposition flag

Symbol interposition is a feature of ELF format, where the symbol defined by one library may be interposed to symbol defined in different. For example, memory debuggers often interpose malloc. While this feature is certainly useful, it is also expensive, because it represent an boundary for compiler optimization.

While comparing some benchmarks to clang, I noticed that clang actually ignore ELF interposition rules. While it is bug, I decided to add -fno-semantic-interposition flag to GCC to get similar behaviour. If interposition is not desirable, ELF's official answer is to use hidden visibility and if the symbol needs to be exported define an alias. This is not always practical thing to do by hand.

Reducing memory use and streaming overhead

LTO streamer is responsible for storing the intermediate language to LTO object file and it is traditionally one of major bottlenecks of link time optimization. The streamer itself is implemented quite effectively, but it is fed a lot of data. Therefore it takes a long time to read everything and it consumes a lot of memory to store it.

One of major improvements of GCC 4.9 was reorganization of type merging (implemented by Richard Biener, SUSE) and it was the change making GCC 4.9 scalable enough for production Firefox builds.

GCC 5 seen many incremental improvements mostly targeted to cleaning up the data=structures and chasing unnecessary data away. A large reduction of /tmp use and improvement in partitionability of the program was achieved by pushing the devirtualization earlier in the optimization queue.

In order to make WHOPR model scalable on large umber of CPUs (or distributed compile nodes) we need to work on this area further. Important changes in this direction are underway with early debug info branch (actively developed by Aldy Hernandez, RedHat) which will clearly separate information used by code generation from debug info and allow to optimize Gimple representation of types.

GCC 4.9 streams in 22096600 trees, while GCC 5.0 20330302; 8% improvement.
GCC 4.9 needs 1614636k of memory to store the trees, while GCC 5.0 1481199k, again an 8% improvement.

A lot bigger change is seen in streaming out the modules for local optimization. Those are temporary .o files created in the /tmp directory during the final link. GCC 5 streams about 50% of types and declarations that GCC 4.9 did.

GCC now (barely) fits in 4GB memory bound for Firefox and most of kernel configurations. Chromium still shows important problems requiring almost 10GB to build.

... And more

The IPA, symbol table and callgraph APIs got a facelift in GCC 5 by conversion to C++ classes (this hard work was mostly done by Martin Liška, SUSE, David Malcolm, RedHat, Trevor Saunders, Mozilla, and a bit by myself). This makes code more compact and also gave us a chance to cleanup terminology and fix mistakes and oddities crept in over years.

As we became more aware of problems of large application, more work also went into making binary sizes smaller at -O2. As one interesting example, data alignment strategies was revisited. The x86-64 backend was especially grateful to align everything to the size of largest vector type (that went up with AVX) causing a lot of bloat. This was fixed by H.J. Lu (Intel), Jakub Jelínek (RedHat) and Uroš Bizjak. Sadly some older GCC assume that all static variables have this alignment and it was considered unsafe to always drop it. There is new -malign-data flag to control it. Data alignment was also completely dropped for virtual tales because these are not accessed in a sequential manner.

I hope new GCC will do a great job for your projects and please report bugs if it does not!

23 comments:

  1. Great work, Jan!
    And thanks for letting everyone know about the great work by this informative post!

    ReplyDelete
    Replies
    1. Thanks for sharing such a great information and I really like this post very much and waiting for new update here.
      Media Technology Project Management

      Delete
  2. Thank you for your detailed writeup! I especially liked the benchmarking information which made the changes tangible.

    ReplyDelete
    Replies
    1. Thanks. I gathered the statistic during the stage4 of development to be sure that new changes works as expected. I will try to get into publishing some performance data, too.

      Delete
  3. It sounds like y'all are taking seriously the implications upon dwarf debuginfo for these more complicated transforms. Please always keep in mind the needs of debuggers and users who love them.

    ReplyDelete
    Replies
    1. It is not quite trivial to model modern optimizations for the debugger, but yep, quite some effort goes into that.

      I am bit concerned about the identical code folding, but according to the feedback from gold developers, the user confusion is not too bad.

      Delete
    2. (We in systemtap land are still trying to come to terms with the dwarf impact of older ipa-split.c partial-inlining stuff: trying to identify the abstract "function entry" point.)

      Delete
    3. Indeed very important. Often debug builds are so slow you can barely use them.
      And with optimized builds exactly those variables you'd need to inspect are .

      Delete
    4. Yep, I can imagine partial inlining to bring some surprises, but conceptually it has chance to work same way as inline functions work? I probably should get more familiar with systemtap to work out how this work. You can just consider the .part functions to be abstract (like debug output machinery does internally in GCC). GCC 5 has some tweaks that make it to do more partial inlining, hope it won't disturb too much.

      ICF may be more interesting for systemtap and live kernel patching as patching one function may by accident patch other, previously equivalent, function. Perhaps -fno-icf is something kernel may want to use by default with occasional look on function merges with allyesconfig/LTO build that can be then applied to the source code by hand?

      GCC always introduce an explicit alias or wrapper (if address is taken) when merging functions, so it should be possible for tools to notice it and simply refuse patching function with aliases, but I do not think we currently provide way to easily check if there is any function that was turned to wrapper of a given function.

      At such a lowlevel scenarios, one idea I was thinking about is to rename the main symbol of the function after merging. I.e. if FOO is merged with BAR, I can turn FOO into a static function called FOO.merged_with.BAR and turn both FOO and BAR to be its aliases. This would hopefully make meaningful name appear in the backtraces and would prevent chance of confusion with tools not aware of ICF.

      Sadly I got this idea about a week ago, so too late for 5.1 - it is a bit risky in a sense that we may run into interesting issues with overly long function names (at Chromium there is one group of function with 17000 members that are all merged to one) and also the change to static function&alias is not 100% straighforward to do (i.e. it will be first time GCC is doing this - most of time it goes the other way keeping the global name and introducing static alias when it needs), so I do not think this idea is going to make it to GCC 5.1.

      Delete
    5. ICF should not be too bad wrt to tracking variable values, because it merge only read-only stuff. Also there is -Og now that gives better performance than -O0 and debug experience should be good.

      Delete
  4. Have you tried LTO on OpenJDK?

    ReplyDelete
    Replies
    1. Not really - I got as far as downloading the source code and unpacking it. Will try to get back to that project soonish.

      Delete
  5. Thanks for the insights!

    ReplyDelete
  6. Thank you, this was an informative writeup.

    I am curious about identical code folding, for C++ code MSVC will fold identical functions even if their address is taken while Gold in safe mode will not since code may rely on functions having unique addresses. Although it is not clear to me if it is standard conformant to fold even if the address is taken.

    What is gcc planning on doing in the case of identical functions that have their address taken? I am curious since this came up on Stackoverflow a few months ago: http://stackoverflow.com/q/26533740/1708801 and it still seems open what conforming behavior should be.

    ReplyDelete
    Replies
    1. GCC checks if the address of function is taken (or can be taken before the function is exported). In this chase it further checks if function is virtual (those can not have address compared) and if not, instead of merging it turns one to wrapper of another, that is having extra jmp . This keeps addresses different.

      Obviously there is room for improvement: instead of jmp one can just place nop in front of the function and one can work harder to see if functions addresses can be compared. We plan to work on it next release.

      It is not conforming to turn two functions to have same address, so MSVC is quite aggressive here. Doing so, for example, breaks GCC itself because to my surprise address compare is done in the precompiled headers code. It works for many other projects, including Firefox.

      Also there is no option for aggressive function merging in GCC, we will add it next release.

      Delete
  7. This comment has been removed by the author.

    ReplyDelete
  8. Very informative article.
    I'll think of all the work that went into this project next time I hit 'make'.
    Thanks for all the good work!

    ReplyDelete
  9. Thank you for great and very useful articles!

    I am struggling with making inlining and devirtualization work.

    For example in the following code:

    struct object
    {
    const struct vtable *vtable;
    int x;
    };

    struct vtable
    {
    void (*inc)(struct object *, int);
    };

    static void inc(struct object *o, int v)
    {
    o->x += v;
    }

    const struct vtable object_vtable = {.inc = inc};

    static struct object *object(int x)
    {
    struct object *o = malloc(sizeof *o);

    *o = (struct object) {.x = x, .vtable = &object_vtable};

    return o;
    }

    int main()
    {
    struct object *o = object(42);

    o->vtable->inc(o, 1);
    /* o->vtable->inc(o, 2); */
    printf("%d\n", o->x);
    free(o);
    }

    Compiling with -O3 and gcc 5.2 I get this =>
    [...]main ()
    {
    :
    # DEBUG x => 42
    # DEBUG o => NULL
    # DEBUG o => NULL
    # DEBUG o => NULL
    # DEBUG v => 1
    printf ("%d\n", 43);
    return 0;

    }
    [..]

    Which is perfect. However if I uncomment the /* o->vtable->inc(o, 2); */ line, i.e.
    [...]
    o->vtable->inc(o, 1);
    o->vtable->inc(o, 2);
    printf("%d\n", o->x);
    free(o);
    [...]

    I get =>
    main ()
    {
    struct object * o;
    int _7;

    :
    # DEBUG x => 42
    o_10 = malloc (16);
    # DEBUG o => o_10
    o_10->vtable = &object_vtable;
    # DEBUG o => NULL
    # DEBUG o => o_10
    # DEBUG v => 1
    o_10->x = 43;
    inc (o_10, 2);
    _7 = o_10->x;
    printf ("%d\n", _7);
    free (o_10);
    return 0;

    }

    The first call is inlined and devirtualized, but the second call is not inlined. Am I missing something here?

    Kind regards,
    Fredrik

    ReplyDelete
    Replies
    1. This is not really a devirtualization in the sense of this post (because you do vtable by hand instead of using C++ virutal keyword). The problem is that compiler does not know if the first invocation of inc is going or not going to change the vtable pointer. One of reasons for specialized devirt machinery is to avoid issues like this.

      Delete
    2. Does this mean that I currently simply can't do this using C (in a controlled manner)?

      Delete
  10. The -fdump-ipa-devirt flag seems to have disappeared from gcc 5.2. Is there some new way to get the devirtualization statistics like you show above? Thanks!

    ReplyDelete
    Replies
    1. -fdump-ipa-devirt is still there:
      (talos)$ /aux/hubicka/5-install/bin/g++ --version
      g++ (GCC) 5.2.1 20151127
      Copyright (C) 2015 Free Software Foundation, Inc.
      This is free software; see the source for copying conditions. There is NO
      warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

      (talos)$ /aux/hubicka/5-install/bin/g++ t.C -O3 -fdump-ipa-devirt -S
      (talos)$ ls -l t.C.058i.devirt
      -rw-r--r-- 1 hubicka _cvsadmin 2558 Jan 11 20:22 t.C.058i.devirt

      Delete
  11. It may make sense to mention that -fno-semantic-interposition is a better version of -Bsymbolic (more compiler-friendly and more ODR-respecting).

    ReplyDelete