Faster C++ program startups
by improving runtime linking efficiency.

Léon Bottou
AT&T Labs - Research, Middletown, NJ, USA
Current address: NEC Research Institute, Princeton, NJ, USA
http://leon.bottou.com

John Ryland
Trolltech AS, Oslo, Norway
mailto:jryland@trolltech.com

1- Introduction

This paper describes runtime linking speed issues arising with shared libraries written in C++. Two methods for reducing the runtime linking complexity, namely objprelink1 and combreloc, are compared and shown to be equivalent. A third method, namely objprelink2, yields additional speedups by resolving virtual table entries on demand. These methods are then compared with the full pre-linking approach. A promising hybrid is suggested.

2- Shared libraries

Shared libraries are a common feature of all modern operating systems. Executable program no longer incorporate the code of common library functions such as printf. Instead the executable program contains unresolved references to the library functions and an additional piece of information describing a shared library file(s) containing the missing functions. Shared libraries themselves can depend on additional shared libraries.

When the program starts up, the runtime linker loads the necessary shared libraries, resolves the undefined symbols, and eventually transfers control to the program itself.

The runtime linker performs the following operations:

The required patches are entirely described by relocation records located in the shared library file itself. Patches must be applied independently in each process using a given shared library. Virtual memory page affected by the patches no longer can be shared between processes using the same library. Resolving symbols also requires a potentially expensive search in the list of all symbols defined by all shared libraries.

The Linux system, and more generally all operating systems using the ELF object file format, use several strategies to reduce the cost of these operations [ELF-1995]:

Although these strategies are fairly successful on usual C programs, they do not address significant cases. A C shared library, for instance, may define the following global variables:

      char *a = "abcdef";
      double (*b)(double) = &cos;

The semantics of the C language implies that both variables a and b must contain absolute pointers to either the string "abcdef" or the library function cos(). Both variables will require a patch when the shared library will be used.

Remark: It is interesting to notice that the example above would almost work if variable b contained the address of a PLT entry for function cos() instead of the absolute address of function cos(). The shared library that initializes variable b can indeed call the cosine function via the function pointer. Calling it from another shared library does not work: the GOT register contains a pointer to the other shared library, while the PLT stub expects that it contains a pointer to the shared library that initializes b.

3- Runtime linking and C++.

It is unfortunate that the C++ language relies a lot more on constructs that are poorly handled by the runtime linking speedup strategies discussed above.

Waldo Bastian [Bastian-2001] has investigated the run-time linking performance of various components of the KDE desktop environment (a large collection of C++ shared libraries and executable). He pinpoints the performance issue to the resolution of the virtual table entries:

In his own words: " Is this a KDE specific problem? That depends on how you look at it. The problem of long startup times is caused by the linker needing to do many relocations. In the case of KDE, these relocations are for a great deal caused by vtable entries. KDE and Qt make liberal use of virtual functions and inheritance. In general every class that inherits from a base class with virtual functions will need to provide a vtable that covers at least the virtual functions of this base class. It seems that each vtable entry causes a relocation, and worse, a relocation that can not be done lazy. "

Waldo Bastian suggests two strategies to improve the runtime linking performance: (a) improving the efficiency of C++ virtual table linking, and (b) pre-relocation and pre-linking of shared libraries. This paper explores the first strategy in section 4 and 5. Jelinek's ELF pre-linker is briefly reviewed in section 6 in order to discuss the pros and cons of both approaches.

4- Reducing the complexity.

Each virtual table entry requires a separate relocation record. The runtime linker processes each relocation record by performing one symbol resolution and applying the corresponding patch.

In a large class hierarchy, the number of distinct symbols referenced by the virtual table entries is usually much lower than the number of virtual table entries themselves. All the derived class virtual tables to a large extent replicate the virtual tables of their base classes. For instance, almost every class in the KDE C++ code inherits about one hundred virtual functions from class QWidget. Every corresponding virtual table contains about one hundred identical entries. Each of these virtual table entries is separately resolved by the runtime linker.

4.1- objprelink-1

The objprelink-1 program [Bottou-2001] provides a first method for avoiding this waste of CPU cycles. Virtual table entries now point to a relay stub jumping to the target function. All virtual table entries addressing the same function use the same relay stub.

There is still a relocation record per virtual table entry, because each virtual table entry contains the absolute address of a relay stub. The runtime linker computes this address without performing a symbol resolution, merely by adding a displacement to the base address of the shared library. The relay stub, on the other hand, requires a relocation record involving a symbol resolution. But the number of relay stubs is much lower than the number of virtual table entries.

The program objprelink-1 (object file pre-linker) achieves this result by post-processing the object files produced by the compiler in a way that coerces the GNU link editor ld into producing the modified shared libraries. Measurements on various KDE2 applications [Bottou-2001] have shown a significant reduction of the runtime linking time. The perceptual responsiveness of the KDE2 desktop environment was significantly improved by using objprelink1.

4.2- combreloc

Recent versions of the Linux operating system provide a similar speedup using a scheme initially implemented in the Solaris 7 operating system. The method relies on modifications in both the link editor and the runtime linker [Jelinek-2001b].

The presence of both modifications ensures that relocation records involving the same symbol resolution require only one execution of the actual symbol resolution code.

4.3- Performance

The performance of both methods has been evaluated using the small benchmark program proposed in [Bastian-2001]. A collection of shared libraries of increasing complexity are compiled from the following source code:


Source code of test library testclassN.so
   #include <qwidget.h>
   template<int T> struct testclass : public QWidget {
     virtual void setSizeIncrement(int w, int h) 
       { QWidget::setSizeIncrement(w+T, h+T); }
   };
   template class testclass<1>;
   ...
   template class testclass<N>;

Each shared library testclassN.so contains N class definition. Each class inherits about one hundred virtual member functions from the QWidget class defined by Trolltech Qt-3.0.3 library.

The following table compares the execution times of C++ programs composed of an empty main function linked with the test shared libraries. Execution times are averaged on 100 runs on a 600 MHz Pentium III laptop computer running Red Hat Linux 7.2.

Four cases are considered according to the presence of option -z combreloc and to the use of the object pre-linker objprelink1. All tests use the same source code, the same compiler, and the same version of the Qt library. Execution time differences are therefore caused by differences in the runtime linking time of the test shared libraries.


Compared execution times (milliseconds).
testclass*.so nocombreloc combreloc objprelink1+
nocombreloc
objprelink1+
combreloc
testclass1.so 46 47 45 45
testclass2.so 47 47 46 47
testclass5.so 48 47 47 48
testclass10.so 50 48 48 48
testclass20.so 54 49 49 48
testclass50.so 66 52 52 52
testclass500.so 275 101 100 91

These results show that both methods provide very significant speedups Employing one method only yields speedups of similar magnitude. Simultaneously using both methods provides only a marginal improvement over each method employed alone.

4.4- Summary

5- On-demand resolution of virtual table entries.

Both the objprelink1 and combreloc methods improve the runtime linking time by reducing the number of actual symbol resolutions. Further improvements can be achieved by providing lazy symbol resolution for virtual table entries.

Lazy symbol resolution improves the responsiveness of interactive programs because the cost of symbol resolution is distributed over the duration of the program execution. It also improves the overall runtime linking time because few programs actually use all the functions provided by the needed shared libraries. On our test system (Red Hat Linux 7.2 with KDE3), the konqueror web browser and its needed shared libraries reference 40021 distinct symbols in their relocation records. A typical session however only uses 17919 distinct symbols. More interestingly, the konqueror window appears after using only 13214 distinct symbols.

5.1- objprelink2

It seems very easy to obtain lazy symbol resolution for virtual table entries. We must first create a PLT entry for each distinct symbol in the virtual tables, and have each virtual table entry point to the corresponding PLT entry stub. The first call to a virtual function will invoke the runtime linker. Subsequent calls will directly jump to the target function.

Yet a difficult problem remains. As discussed by a remark in section 2, the PLT stub code expects that the got pointer register contains a pointer to the GOT of the shared library containing the PLT entry. This is true when the virtual function is called from code contained in this same shared library. This is no longer true when the virtual function is called from the executable program or from another shared library. Such a call usually causes a program crash.

Program objprelink2 solves the problem by inserting fake relocation records that cause the runtime linker to patch the relevant PLT entries and make them work regardless of the contents of the got pointer register.

The standard i386 stub code for a shared library PLT is laid out according to the following template. The got pointer register is named %ebx.

   PLT0:  pushl 4(%ebx)   // Code for entering runtime linker
          jmp   *8(%ebx)  // Code for entering runtime linker
          ....
                          // Beginning of PLT stub i.
   PLTi:  jmp   *offset_of_got_entry(%ebx)  
                          // This GOT entry initially points 
                          // to the following instruction.
          pushl $relocation_record_for_got_entry
          jmp   PLT0
                          // This jump enters the runtime linker code that
                          // updates the GOT entry and calls the target.
                          // Subsequent calls to this PLT entry will
                          // then jump directly to the target function.

Program objprelink2 redefines the relevant PLT entries according to the following model.

   PLT0:  pushl GOT+4
          jmp   *GOT+8
          ....
   PLTi:  jmp   *GOT+offset_of_got_entry
          pushl $relocation_record_for_got_entry
          jmp   PLT0

References to the got pointer register have been replaced by the absolute address of the relevant GOT entries. Such a construct is normally used in executable programs. Since shared libraries can be loaded at variable addresses, objprelink2 creates additional relocation records that will adjust these absolute addresses according to the actual loading address of the shared library.

The program objprelink2 performs these modifications by preprocessing the object files and post processing the shared library. Additional details can be found here. It would be considerably more elegant and more efficient to perform these operations inside the GNU link editor ld. Yet objprelink2 is sufficient for verifying the concept and measuring the resulting performance.

5.2- Relocation counts by type

The performance of the objprelink2 method can be measured using the test shared libraries testclassN.so presented in section 4.3. A first measurement counts the number of relocation records of each type in the test shared libraries testclassN.so. Four record types are relevant for Intel processors:

R_386_RELATIVE These records describe patches than only depend on the loading address of the shared library and therefore require no symbol resolution.
R_386_32 These records describe patches that require a symbol resolution when the program starts up. Virtual table entries are usually constructed using these relocation records.
R_386_JUMP_SLOT These records describe GOT patches that involve a lazy symbol resolution. These relocation records are not processed when the program starts up, but when the corresponding function is first invoked.
R_386_GLOB_DATA These records describe GOT patches that require a symbol resolution when the program starts up. These records are listed for reference and remained unchanged.

The following tables present the relocation record counts for the test shared libraries compiled as usual, using objprelink1, or using objprelink2.


Relocation record counts.
Normal compilation.
R_386_32 R_386_GLOB_DAT R_386_JUMP_SLOT R_386_RELATIVE
testclass1.so 121 9 8 3
testclass2.so 242 13 8 3
testclass5.so 605 25 8 3
testclass10.so 1210 45 8 3
testclass20.so 2420 85 8 3
testclass50.so 6050 205 8 3
testclass500.so 60500 2005 8 3

Relocation record counts.
Compiled with objprelink1.
R_386_32 R_386_GLOB_DAT R_386_JUMP_SLOT R_386_RELATIVE
testclass1.so 120 9 8 123
testclass2.so 125 13 8 243
testclass5.so 140 25 8 603
testclass10.so 165 45 8 1203
testclass20.so 215 85 8 2403
testclass50.so 365 205 8 6003
testclass500.so 2615 2005 8 60003

Relocation record counts.
Compiled with objprelink2.
R_386_32 R_386_GLOB_DAT R_386_JUMP_SLOT R_386_RELATIVE
testclass1.so 1 9 127 244
testclass2.so 2 13 131 368
testclass5.so 5 25 143 740
testclass10.so 10 45 163 1360
testclass20.so 20 85 203 2600
testclass50.so 50 205 323 6320
testclass500.so 500 2005 2123 62120

These counts illustrate how both methods work:

5.3 - Execution times

The following table compares the execution times of C++ programs composed of an empty main function linked with the test shared libraries testclassN.so. Execution times are averaged on 100 runs on a 600 MHz Pentium III laptop computer running Red Hat Linux 7.2.

Four cases are considered, depending on whether the Qt library and/or the test shared library were compiled using the default settings or using objprelink2. The combreloc option and the symbol resolution cache (see section 4.2) were enabled in all experiments.


Compared execution times (milliseconds).
libqt-mt.so
testclass*.so
default
default
default
objprelink2
objprelink2
default
objprelink2
objprelink2
testclass1.so 93 92 47 46
testclass2.so 93 94 47 46
testclass5.so 94 92 47 46
testclass10.so 93 92 48 46
testclass20.so 95 93 49 47
testclass50.so 100 97 52 49
testclass500.so 144 125 101 75

Lazy linking virtual table entries brings another 50% speedup over previous improvements. Compiling the test shared libraries brings a speedup that is roughly proportional to the number of classes in the shared library. Switching from a normal version to a version of Qt compiled with objprelink2 saves about 40 milliseconds.

5.4- Qt performance

The following table counts the number of relocation records of each type in the Qt-3.0.3 library. The number of upfront symbol resolutions has been reduced from 26823 to 3640.


Relocation record counts for libqt-mt.so.3.0.3.
R_386_32 R_386_GLOB_DAT R_386_JUMP_SLOT R_386_RELATIVE
default 24571 2252 6916 11356
objprelink2 1388 2252 12339 40750

The following table indicates the sizes of various sections of the Qt-3.0.3 shared library file.


Section sizes for libqt-mt.so.3.0.3 (stripped).
Total .plt .text .{*}data .got .rel.{*}
default 6660K 110K 4147K 1010K 37K 361K
objprelink2 6939K 198K 4209K 1055K 58K 454K

Several size increases can be observed:

An optimal implementation of the objprelink2 ideas would not change the size of either the TEXT or RODATA section. Details can be found here.

5.5- KDE performance

Our final test consists in compiling all the KDE3 desktop environment using objprelink2. Libraries compiled with objprelink2 seem to perform as flawlessly as the originals on our test machine, a 600 MHz Pentium III laptop running Red Hat Linux 7.2.

The following printout shows typical statistics returned by the runtime linker when running the web browser konqueror, before and after recompiling Qt and KDE3 with objprelink2.


Using combreloc only:
[leonb@fuji src]$ LD_DEBUG=statistics konqueror
32013:  runtime linker statistics:
32013:    total startup time in dynamic loader: 159462206 clock cycles
32013:              time needed for relocation: 156734640 clock cycles (98.2%)
32013:                   number of relocations: 22312
32013:        number of relocations from cache: 46825
32013:             time needed to load objects: 2506857 clock cycles (1.5%)


After recompiling Qt and KDE3 with objprelink2:
[leonb@fuji leonb]$ LD_DEBUG=statistics konqueror
28630:  runtime linker statistics:
28630:    total startup time in dynamic loader: 73010700 clock cycles
28630:              time needed for relocation: 70757732 clock cycles (96.9%)
28630:                   number of relocations: 8201
28630:        number of relocations from cache: 4414
28630:             time needed to load objects: 2014055 clock cycles (2.7%)

Multiple runs can return very different results for the time needed to load objects. The results for the time needed for relocation are very consistent. The numbers of relocations, of course, never change across multiple runs.

Recompiling both Qt and KDE3 with objprelink2 reduces the time needed for relocation by one half compared to combreloc alone. This represents about 150 milliseconds on our test machine. This result however does not account for the time needed to perform the lazy symbol resolutions.

A more interesting measurement evaluates the delay between the moment konqueror is launched and the moment konqueror starts waiting for user input. This startup time can be evaluated by tracing well chosen system calls using the program strace. The following table reports the konqueror startup times with different setups for lazy symbol resolution:


Konqueror startup times (seconds)
No lazy resolution,
(using LD_BIND_NOW=1)
   3.07
Normal lazy resolution,
(using combreloc).
2.80
Normal lazy resolution,
Qt compiled with objprelink2.
2.78
Normal lazy resolution,
Qt and KDE3 compiled with objprelink2.
2.66

The konqueror startup time decreases by about 140 milliseconds when both Qt and KDE3 are recompiled using objprelink2. This is close to the results obtained from the runtime linker statistics. It also demonstrates that a large improvement in runtime linking speed no longer translates into obvious improvements in application startup times.

Only recompiling Qt does not bring significant speedups. This result probably means that a large proportion of Qt virtual table symbols need to be resolved during the application startup.

5.6- Summary

Follow this link to replicate.

6- Comparison with the ELF pre-linker.

6.1- Pre-linker overview

Another way to improve the runtime linking consists in pre-relocating and pre-linking shared libraries. This approach is very nicely illustrated by Jelinek's ELF pre-linker [Jelinek-2001].

Pre-linking involves an additional subtlety. The ELF document specifies that undefined symbols in shared libraries must be first searched in the main executable program, then searched in the needed shared libraries. It is therefore necessary to record potential conflicts in the executable file, that is to say, a list of symbols defined by the main executable that override symbols defined by shared libraries.

When a program starts, the runtime linker checks (a) that the executable program file contains an up-to-date conflict list, (b) that all the needed shared libraries have been pre-relocated and pre-linked, (c) that all the shared libraries can be loaded at their selected pre-relocation address, and (d) that no shared library has been modified since pre-linking object files that depend on them. When all conditions are verified, the runtime linker needs only apply the few patches described by the conflict list. This is obviously a very fast operation since it mostly consists in doing nothing. It also maximizes the number of virtual memory pages that can be shared between several instances of the same shared library running in different processes.

The price of this performance is a reduced flexibility. The runtime linker must revert to the usual processing as soon as one of the above condition is not met. Upgrading a minor shared library in the system is very likely to compromise the runtime linking speed of a large number of programs. The only way to restore the optimal speed then consists of running the pre-linker on all executables and all shared libraries in the system. This is a potentially expensive operation.

6.2- Suggestion

The results reported in this paper suggest a way to obtain nearly optimal runtime linking speed with a much smaller cost in flexibility.

7- Conclusion.

The runtime linking properties of C++ programs has been described. Complexity reduction methods such as objprelink1 and combreloc have been compared and shown to be equivalent. Further speedups using lazy resolution of virtual table symbols have been demonstrated using objprelink2. Comparing these methods with the full ELF pre-linker has lead to suggesting a promising hybrid approach.

References

[ELF-1995]
TIS Committee: Tool Interface Standard (TIS) - Executable and Linking Format (ELF) Specification - version 1.2, May 1995,
http://www.linuxbase.org/spec/refspecs/elf.pdf.

[Bastian-2001]
Waldo Bastian: Making C++ ready for the desktop, May 2001,
http://www.suse.de/~bastian/Export/linking.txt.

[Jelinek-2001]
Jakub Jelinek: ELF prelinking, June 2001,
http://sources.redhat.com/ml/libc-alpha/2001-06/msg00113.html. Additional info here.

[Bottou-2001]
Leon Bottou: Faster startups by fixing C++ object files, July 2001,
http://objprelink.sourceforge.net/index-1.html. Additional info here.

[Jelinek-2001-b]
Jakub Jelinek: Support -z combreloc in binutils, August-2001,
http://sources.redhat.com/ml/binutils/2001-08/msg00311.html.