Objprelink is a code transformation tool that optimizes C++ shared libraries in order to reduce the runtime linking time. This tool was very effective in 2001 for speeding up KDE 2. This is no longer the case because newer runtime linkers perform much better in that respect. The paper below contains explanations as well as links to the actual code.

Faster C++ program startups
by improving runtime linking efficiency.

Léon Bottou
AT&T Labs - Research, Middletown, NJ, USA
Current address: NEC Research Institute, Princeton, NJ, USA
http://leon.bottou.com

John Ryland
Trolltech AS, Oslo, Norway
mailto:jryland@trolltech.com

May 2002

1- Introduction

This paper describes runtime linking speed issues arising with shared libraries written in C++. Two methods for reducing the runtime linking complexity, namely objprelink1 and combreloc, are compared and shown to be equivalent. A third method, namely objprelink2, yields additional speedups by resolving virtual table entries on demand. These methods are then compared with the full pre-linking approach. A promising hybrid is suggested.

2- Shared libraries

Shared libraries are a common feature of all modern operating systems. Executable program no longer incorporate the code of common library functions such as printf. Instead the executable program contains unresolved references to the library functions and an additional piece of information describing a shared library file(s) containing the missing functions. Shared libraries themselves can depend on additional shared libraries.

When the program starts up, the runtime linker loads the necessary shared libraries, resolves the undefined symbols, and eventually transfers control to the program itself.

The runtime linker performs the following operations:

A suitable space in memory must be allocated for each necessary shared library.
The shared library code and data must appear in this space. This is usually achieved by mapping the shared library file using virtual memory. Multiple processes then can use the same read-only memory image of the shared library. This approach yields considerable memory savings
The shared library might contain absolute addresses that must be patched in order to reflect the actual loading address of the library.
Undefined symbols in the program and all shared libraries must be resolved using symbols defined in other shared libraries. Code and data referring to each undefined symbol must be patched in order to reflect the address of the corresponding definitions.

The required patches are entirely described by relocation records located in the shared library file itself. Patches must be applied independently in each process using a given shared library. Virtual memory pages affected by the patches can no longer be shared between processes using the same library. Resolving symbols also requires a potentially expensive search in the list of all symbols defined by all shared libraries.

The Linux system, and more generally all operating systems using the ELF object file format, use several strategies to reduce the cost of these operations [ELF-1995]:

Shared libraries are usually compiled as position independent code (PIC). For instance, the compiler will generate branch instructions that specify the branch target relative to the branch instruction. Position independent code reduces the number of necessary patches.
References to undefined symbols are grouped together in a data area known as the global offset table (GOT). Regrouping undefined references ensures that a minimal number of virtual memory pages are affected by the patches. A piece of code that would otherwise refer to an undefined symbol is changed into an instruction sequence that retrieves the desired address from the GOT. To make this easier, compilers compute the location of the GOT at the beginning of each function and keep it in a specific processor register named the got pointer register.
References to undefined functions can be resolved on demand. Jumps to undefined symbols are replaced by jumps to stubs placed in the program location table (PLT). Each stub jumps to the address specified by a corresponding entry in the GOT. This address initially refers to a piece of code that locates the symbol definition, patches the GOT entry, and finally executes the function itself. The first call to the target function performs the link edition. Subsequent calls directly call the target function. This lazy symbol resolution ensures that no time is wasted resolving function symbols that are not useful for a particular program. This is very significant because programs are usually linked with fairly large libraries (such as the C library) but only use a small fraction of the facilities offered by the library. Lazy symbol resolution also reduces the apparent startup time of interactive program because it avoids performing a complete link edition upfront.

Although these strategies are fairly successful on usual C programs, they do not address significant cases. A C shared library, for instance, may define the following global variables:

      char *a = "abcdef";
      double (*b)(double) = &cos;

The semantics of the C language implies that both variables a and b must contain absolute pointers to either the string "abcdef" or the library function cos(). Both variables will require a patch when the shared library will be used.

Remark: It is interesting to notice that the example above would almost work if variable b contained the address of a PLT entry for function cos() instead of the absolute address of function cos(). The shared library that initializes variable b can indeed call the cosine function via the function pointer. Calling it from another shared library does not work: the GOT register contains a pointer to the other shared library, while the PLT stub expects that it contains a pointer to the shared library that initializes b.

3- Runtime linking and C++.

It is unfortunate that the C++ language relies a lot more on constructs that are poorly handled by the runtime linking speedup strategies discussed above.

Waldo Bastian [Bastian-2001] has investigated the run-time linking performance of various components of the KDE desktop environment (a large collection of C++ shared libraries and executable). He pinpoints the performance issue to the resolution of the virtual table entries:

In his own words: " Is this a KDE specific problem? That depends on how you look at it. The problem of long startup times is caused by the linker needing to do many relocations. In the case of KDE, these relocations are for a great deal caused by vtable entries. KDE and Qt make liberal use of virtual functions and inheritance. In general every class that inherits from a base class with virtual functions will need to provide a vtable that covers at least the virtual functions of this base class. It seems that each vtable entry causes a relocation, and worse, a relocation that can not be done lazy. "

Waldo Bastian suggests two strategies to improve the runtime linking performance: (a) improving the efficiency of C++ virtual table linking, and (b) pre-relocation and pre-linking of shared libraries. This paper explores the first strategy in section 4 and 5. Jelinek's ELF pre-linker is briefly reviewed in section 6 in order to discuss the pros and cons of both approaches.

4- Reducing the complexity.

Each virtual table entry requires a separate relocation record. The runtime linker processes each relocation record by performing one symbol resolution and applying the corresponding patch.

In a large class hierarchy, the number of distinct symbols referenced by the virtual table entries is usually much lower than the number of virtual table entries themselves. All the derived class virtual tables to a large extent replicate the virtual tables of their base classes. For instance, almost every class in the KDE C++ code inherits about one hundred virtual functions from class QWidget. Every corresponding virtual table contains about one hundred identical entries. Each of these virtual table entries is separately resolved by the runtime linker.

4.1- objprelink-1

The objprelink-1 program [Bottou-2001] provides a first method for avoiding this waste of CPU cycles. Virtual table entries now point to a relay stub jumping to the target function. All virtual table entries addressing the same function use the same relay stub.

There is still a relocation record per virtual table entry, because each virtual table entry contains the absolute address of a relay stub. The runtime linker computes this address without performing a symbol resolution, merely by adding a displacement to the base address of the shared library. The relay stub, on the other hand, requires a relocation record involving a symbol resolution. But the number of relay stubs is much lower than the number of virtual table entries.

The program objprelink-1 (object file pre-linker) achieves this result by post-processing the object files produced by the compiler in a way that coerces the GNU link editor ld into producing the modified shared libraries. Measurements on various KDE2 applications [Bottou-2001] have shown a significant reduction of the runtime linking time. The perceptual responsiveness of the KDE2 desktop environment was significantly improved by using objprelink1.

4.2- combreloc

Recent versions of the Linux operating system provide a similar speedup using a scheme initially implemented in the Solaris 7 operating system. The method relies on modifications in both the link editor and the runtime linker [Jelinek-2001b].

The GNU link editor ld option -z combreloc, enabled by default, reorders the shared library relocation records according to the name of the symbol to resolve. All relocation records referencing the same symbol are placed consecutively in the relocation record table.
The runtime linker processes the relocation records in sequence and always caches the last symbol resolution.

The presence of both modifications ensures that relocation records involving the same symbol resolution require only one execution of the actual symbol resolution code.

4.3- Performance

The performance of both methods has been evaluated using the small benchmark program proposed in [Bastian-2001]. A collection of shared libraries of increasing complexity are compiled from the following source code:

Source code of test library testclassN.so

   #include <qwidget.h>
   template<int T> struct testclass : public QWidget {
     virtual void setSizeIncrement(int w, int h) 
       { QWidget::setSizeIncrement(w+T, h+T); }
   };
   template class testclass<1>;
   ...
   template class testclass<N>;

Each shared library testclassN.so contains N class definition. Each class inherits about one hundred virtual member functions from the QWidget class defined by Trolltech Qt-3.0.3 library.

The following table compares the execution times of C++ programs composed of an empty main function linked with the test shared libraries. Execution times are averaged on 100 runs on a 600 MHz Pentium III laptop computer running Red Hat Linux 7.2.

Four cases are considered according to the presence of option -z combreloc and to the use of the object pre-linker objprelink1. All tests use the same source code, the same compiler, and the same version of the Qt library. Execution time differences are therefore caused by differences in the runtime linking time of the test shared libraries.

Compared execution times (milliseconds).

testclass*.so	nocombreloc	combreloc	objprelink1+ nocombreloc	objprelink1+ combreloc
testclass1.so	46	47	45	45
testclass2.so	47	47	46	47
testclass5.so	48	47	47	48
testclass10.so	50	48	48	48
testclass20.so	54	49	49	48
testclass50.so	66	52	52	52
testclass500.so	275	101	100	91

These results show that both methods provide very significant speedups Employing one method only yields speedups of similar magnitude. Simultaneously using both methods provides only a marginal improvement over each method employed alone.

4.4- Summary

Both the objprelink1 and combreloc methods yield the same speedups because they reduce the complexity of the runtime linking process in similar ways.
There is no point using objprelink1 on a system implementing the combreloc method. Follow this link to check whether your Linux system implements the combreloc method.

5- On-demand resolution of virtual table entries.

Both the objprelink1 and combreloc methods improve the runtime linking time by reducing the number of actual symbol resolutions. Further improvements can be achieved by providing lazy symbol resolution for virtual table entries.

Lazy symbol resolution improves the responsiveness of interactive programs because the cost of symbol resolution is distributed over the duration of the program execution. It also improves the overall runtime linking time because few programs actually use all the functions provided by the needed shared libraries. On our test system (Red Hat Linux 7.2 with KDE3), the konqueror web browser and its needed shared libraries reference 40021 distinct symbols in their relocation records. A typical session however only uses 17919 distinct symbols. More interestingly, the konqueror window appears after using only 13214 distinct symbols.

5.1- objprelink2

It seems very easy to obtain lazy symbol resolution for virtual table entries. We must first create a PLT entry for each distinct symbol in the virtual tables, and have each virtual table entry point to the corresponding PLT entry stub. The first call to a virtual function will invoke the runtime linker. Subsequent calls will directly jump to the target function.

Yet a difficult problem remains. As discussed by a remark in section 2, the PLT stub code expects that the got pointer register contains a pointer to the GOT of the shared library containing the PLT entry. This is true when the virtual function is called from code contained in this same shared library. This is no longer true when the virtual function is called from the executable program or from another shared library. Such a call usually causes a program crash.

Program objprelink2 solves the problem by inserting fake relocation records that cause the runtime linker to patch the relevant PLT entries and make them work regardless of the contents of the got pointer register.

The standard i386 stub code for a shared library PLT is laid out according to the following template. The got pointer register is named %ebx.

   PLT0:  pushl 4(%ebx)   // Code for entering runtime linker
          jmp   *8(%ebx)  // Code for entering runtime linker
          ....
                          // Beginning of PLT stub i.
   PLTi:  jmp   *offset_of_got_entry(%ebx)  
                          // This GOT entry initially points 
                          // to the following instruction.
          pushl $relocation_record_for_got_entry
          jmp   PLT0
                          // This jump enters the runtime linker code that
                          // updates the GOT entry and calls the target.
                          // Subsequent calls to this PLT entry will
                          // then jump directly to the target function.

Program objprelink2 redefines the relevant PLT entries according to the following model.

   PLT0:  pushl GOT+4
          jmp   *GOT+8
          ....
   PLTi:  jmp   *GOT+offset_of_got_entry
          pushl $relocation_record_for_got_entry
          jmp   PLT0

References to the got pointer register have been replaced by the absolute address of the relevant GOT entries. Such a construct is normally used in executable programs. Since shared libraries can be loaded at variable addresses, objprelink2 creates additional relocation records that will adjust these absolute addresses according to the actual loading address of the shared library.

The program objprelink2 performs these modifications by preprocessing the object files and post processing the shared library. Additional details can be found here. It would be considerably more elegant and more efficient to perform these operations inside the GNU link editor ld. Yet objprelink2 is sufficient for verifying the concept and measuring the resulting performance.

5.2- Relocation counts by type

The performance of the objprelink2 method can be measured using the test shared libraries testclassN.so presented in section 4.3. A first measurement counts the number of relocation records of each type in the test shared libraries testclassN.so. Four record types are relevant for Intel processors:

R_386_RELATIVE These records describe patches than only depend on the loading address of the shared library and therefore require no symbol resolution.

R_386_32 These records describe patches that require a symbol resolution when the program starts up. Virtual table entries are usually constructed using these relocation records.
R_386_JUMP_SLOT These records describe GOT patches that involve a lazy symbol resolution. These relocation records are not processed when the program starts up, but when the corresponding function is first invoked.
R_386_GLOB_DATA These records describe GOT patches that require a symbol resolution when the program starts up. These records are listed for reference and remained unchanged.

The following tables present the relocation record counts for the test shared libraries compiled as usual, using objprelink1, or using objprelink2.

Relocation record counts.
Normal compilation.

	R_386_32	R_386_GLOB_DAT	R_386_JUMP_SLOT	R_386_RELATIVE
testclass1.so	121	9	8	3
testclass2.so	242	13	8	3
testclass5.so	605	25	8	3
testclass10.so	1210	45	8	3
testclass20.so	2420	85	8	3
testclass50.so	6050	205	8	3
testclass500.so	60500	2005	8	3

Relocation record counts.
Compiled with objprelink1.

	R_386_32	R_386_GLOB_DAT	R_386_JUMP_SLOT	R_386_RELATIVE
testclass1.so	120	9	8	123
testclass2.so	125	13	8	243
testclass5.so	140	25	8	603
testclass10.so	165	45	8	1203
testclass20.so	215	85	8	2403
testclass50.so	365	205	8	6003
testclass500.so	2615	2005	8	60003

Relocation record counts.
Compiled with objprelink2.

	R_386_32	R_386_GLOB_DAT	R_386_JUMP_SLOT	R_386_RELATIVE
testclass1.so	1	9	127	244
testclass2.so	2	13	131	368
testclass5.so	5	25	143	740
testclass10.so	10	45	163	1360
testclass20.so	20	85	203	2600
testclass50.so	50	205	323	6320
testclass500.so	500	2005	2123	62120

These counts illustrate how both methods work:

Program objprelink1 changes the virtual table relocation records from type R_386_32 to type R_386_RELATIVE. It also introduces one new R_386_32 relocation record for each distinct symbol referenced by the virtual table entries.
Program objprelink2 also changes the virtual table relocation records from type R_386_32 to type R_386_RELATIVE. It introduces a new R_386_JUMP_SLOT lazy relocation record for each distinct symbol referenced by the virtual table entries. It also introduces additional R_386_RELATIVE relocation record for fixing the PLT stubs.

5.3 - Execution times

The following table compares the execution times of C++ programs composed of an empty main function linked with the test shared libraries testclassN.so. Execution times are averaged on 100 runs on a 600 MHz Pentium III laptop computer running Red Hat Linux 7.2.

Four cases are considered, depending on whether the Qt library and/or the test shared library were compiled using the default settings or using objprelink2. The combreloc option and the symbol resolution cache (see section 4.2) were enabled in all experiments.

Compared execution times (milliseconds).

libqt-mt.so testclass*.so	default default	default objprelink2	objprelink2 default	objprelink2 objprelink2
testclass1.so	93	92	47	46
testclass2.so	93	94	47	46
testclass5.so	94	92	47	46
testclass10.so	93	92	48	46
testclass20.so	95	93	49	47
testclass50.so	100	97	52	49
testclass500.so	144	125	101	75

Lazy linking virtual table entries brings another 50% speedup over previous improvements. Compiling the test shared libraries brings a speedup that is roughly proportional to the number of classes in the shared library. Switching from a normal version to a version of Qt compiled with objprelink2 saves about 40 milliseconds.

5.4- Qt performance

The following table counts the number of relocation records of each type in the Qt-3.0.3 library. The number of upfront symbol resolutions has been reduced from 26823 to 3640.

Relocation record counts for libqt-mt.so.3.0.3.

	R_386_32	R_386_GLOB_DAT	R_386_JUMP_SLOT	R_386_RELATIVE
default	24571	2252	6916	11356
objprelink2	1388	2252	12339	40750

The following table indicates the sizes of various sections of the Qt-3.0.3 shared library file.

Section sizes for libqt-mt.so.3.0.3 (stripped).

	Total	.plt	.text	.{*}data	.got	.rel.{*}
default	6660K	110K	4147K	1010K	37K	361K
objprelink2	6939K	198K	4209K	1055K	58K	454K

Several size increases can be observed:

The size of both the PLT and the GOT increases in order to accommodate one additional PLT stubs for each distinct symbol referenced by virtual table entries.
The size of the REL sections increases in order to accommodate the additional relocation records required to fix the PLT stub code.
The size of the TEXT section increases because of technical reasons related to the objprelink2 implementation.
The size of the RODATA section increases significantly because our version of the GNU link editor stubbornly refuses to merge identical read-only constants when performing the incremental link edition needed by objprelink2.

An optimal implementation of the objprelink2 ideas would not change the size of either the TEXT or RODATA section. Details can be found here.

5.5- KDE performance

Our final test consists in compiling all the KDE3 desktop environment using objprelink2. Libraries compiled with objprelink2 seem to perform as flawlessly as the originals on our test machine, a 600 MHz Pentium III laptop running Red Hat Linux 7.2.

The following printout shows typical statistics returned by the runtime linker when running the web browser konqueror, before and after recompiling Qt and KDE3 with objprelink2.

Using combreloc only:

[leonb@fuji src]$ LD_DEBUG=statistics konqueror
32013:  runtime linker statistics:
32013:    total startup time in dynamic loader: 159462206 clock cycles
32013:              time needed for relocation: 156734640 clock cycles (98.2%)
32013:                   number of relocations: 22312
32013:        number of relocations from cache: 46825
32013:             time needed to load objects: 2506857 clock cycles (1.5%)

After recompiling Qt and KDE3 with objprelink2:

[leonb@fuji leonb]$ LD_DEBUG=statistics konqueror
28630:  runtime linker statistics:
28630:    total startup time in dynamic loader: 73010700 clock cycles
28630:              time needed for relocation: 70757732 clock cycles (96.9%)
28630:                   number of relocations: 8201
28630:        number of relocations from cache: 4414
28630:             time needed to load objects: 2014055 clock cycles (2.7%)

Multiple runs can return very different results for the time needed to load objects. The results for the time needed for relocation are very consistent. The numbers of relocations, of course, never change across multiple runs.

Recompiling both Qt and KDE3 with objprelink2 reduces the time needed for relocation by one half compared to combreloc alone. This represents about 150 milliseconds on our test machine. This result however does not account for the time needed to perform the lazy symbol resolutions.

A more interesting measurement evaluates the delay between the moment konqueror is launched and the moment konqueror starts waiting for user input. This startup time can be evaluated by tracing well chosen system calls using the program strace. The following table reports the konqueror startup times with different setups for lazy symbol resolution:

Konqueror startup times (seconds)

No lazy resolution, (using `LD_BIND_NOW=1`)	3.07
Normal lazy resolution, (using `combreloc`).	2.80
Normal lazy resolution, Qt compiled with `objprelink2`.	2.78
Normal lazy resolution, Qt and KDE3 compiled with `objprelink2`.	2.66

The konqueror startup time decreases by about 140 milliseconds when both Qt and KDE3 are recompiled using objprelink2. This is close to the results obtained from the runtime linker statistics. It also demonstrates that a large improvement in runtime linking speed no longer translates into obvious improvements in application startup times.

Only recompiling Qt does not bring significant speedups. This result probably means that a large proportion of Qt virtual table symbols need to be resolved during the application startup.

5.6- Summary

Program objprelink2 still reduces the runtime linking time by about 50%. This reduction comes on top of the more significant speedups achieved by either combreloc or objprelink1.
The runtime linking time now represents only a small part of the KDE3 application startup time. Large speedups are to be found elsewhere.

Follow this link to replicate.

6- Comparison with the ELF pre-linker.

6.1- Pre-linker overview

Another way to improve the runtime linking consists in pre-relocating and pre-linking shared libraries. This approach is very nicely illustrated by Jelinek's ELF pre-linker [Jelinek-2001].

Pre-relocating a shared library means selecting a preferred memory address for loading the shared library, and pre-patching the shared library file for execution at this selected address.
Pre-linking a shared library means identifying which other shared libraries are needed by this library, and pre-patching the shared library file by assuming that the needed shared libraries will be loaded at their selected pre-relocation addresses.

Pre-linking involves an additional subtlety. The ELF document specifies that undefined symbols in shared libraries must be first searched in the main executable program, then searched in the needed shared libraries. It is therefore necessary to record potential conflicts in the executable file, that is to say, a list of symbols defined by the main executable that override symbols defined by shared libraries.

When a program starts, the runtime linker checks (a) that the executable program file contains an up-to-date conflict list, (b) that all the needed shared libraries have been pre-relocated and pre-linked, (c) that all the shared libraries can be loaded at their selected pre-relocation address, and (d) that no shared library has been modified since pre-linking object files that depend on them. When all conditions are verified, the runtime linker needs only apply the few patches described by the conflict list. This is obviously a very fast operation since it mostly consists in doing nothing. It also maximizes the number of virtual memory pages that can be shared between several instances of the same shared library running in different processes.

The price of this performance is a reduced flexibility. The runtime linker must revert to the usual processing as soon as one of the above condition is not met. Upgrading a minor shared library in the system is very likely to compromise the runtime linking speed of a large number of programs. The only way to restore the optimal speed then consists of running the pre-linker on all executables and all shared libraries in the system. This is a potentially expensive operation.

6.2- Suggestion

The results reported in this paper suggest a way to obtain nearly optimal runtime linking speed with a much smaller cost in flexibility.

Shared libraries could be pre-relocated but not pre-linked. This would eliminate the pre-linking dependencies between required shared libraries and eliminate the need for maintaining up-to-date conflict lists. Yet it would deliver optimal virtual memory page sharing for all shared libraries than can be loaded at their selected pre-relocation address.
Lazy symbol resolution can be arranged as suggested by the objprelink2 program. Extensive lazy symbol resolution would considerably reduce the runtime linking cost of not performing the pre-linking step.

7- Conclusion.

The runtime linking properties of C++ programs has been described. Complexity reduction methods such as objprelink1 and combreloc have been compared and shown to be equivalent. Further speedups using lazy resolution of virtual table symbols have been demonstrated using objprelink2. Comparing these methods with the full ELF pre-linker has lead to suggesting a promising hybrid approach.

References

[ELF-1995]
TIS Committee: Tool Interface Standard (TIS) - Executable and Linking Format (ELF) Specification - version 1.2, May 1995, http://www.linuxbase.org/spec/refspecs/elf.pdf.

[Bastian-2001]
Waldo Bastian: Making C++ ready for the desktop, May 2001, http://www.suse.de/~bastian/Export/linking.txt.

[Jelinek-2001]
Jakub Jelinek: ELF prelinking, June 2001, http://sources.redhat.com/ml/libc-alpha/2001-06/msg00113.html. Additional info here.

[Bottou-2001]
Leon Bottou: Faster startups by fixing C++ object files, July 2001, http://objprelink.sourceforge.net/index-1.html. Additional info here.

[Jelinek-2001-b]
Jakub Jelinek: Support -z combreloc in binutils, August-2001, http://sources.redhat.com/ml/binutils/2001-08/msg00311.html.

R_386_RELATIVE	These records describe patches than only depend on the loading address of the shared library and therefore require no symbol resolution.
R_386_32	These records describe patches that require a symbol resolution when the program starts up. Virtual table entries are usually constructed using these relocation records.
R_386_JUMP_SLOT	These records describe GOT patches that involve a lazy symbol resolution. These relocation records are not processed when the program starts up, but when the corresponding function is first invoked.
R_386_GLOB_DATA	These records describe GOT patches that require a symbol resolution when the program starts up. These records are listed for reference and remained unchanged.

Faster C++ program startups by improving runtime linking efficiency.