Léon Bottou
AT&T Labs - Research, Middletown, NJ, USA
Current address: NEC Research Institute, Princeton, NJ, USA
http://leon.bottou.com
John Ryland
Trolltech AS, Oslo, Norway
mailto:jryland@trolltech.com
This paper describes runtime linking speed issues arising with shared libraries written in C++. Two methods for reducing the runtime linking complexity, namely objprelink1 and combreloc, are compared and shown to be equivalent. A third method, namely objprelink2, yields additional speedups by resolving virtual table entries on demand. These methods are then compared with the full pre-linking approach. A promising hybrid is suggested.
Shared libraries are a common feature of all modern operating systems. Executable program no longer incorporate the code of common library functions such as printf. Instead the executable program contains unresolved references to the library functions and an additional piece of information describing a shared library file(s) containing the missing functions. Shared libraries themselves can depend on additional shared libraries.
When the program starts up, the runtime linker loads the necessary shared libraries, resolves the undefined symbols, and eventually transfers control to the program itself.
The runtime linker performs the following operations:
The required patches are entirely described by relocation records located in the shared library file itself. Patches must be applied independently in each process using a given shared library. Virtual memory page affected by the patches no longer can be shared between processes using the same library. Resolving symbols also requires a potentially expensive search in the list of all symbols defined by all shared libraries.
The Linux system, and more generally all operating systems using the ELF object file format, use several strategies to reduce the cost of these operations [ELF-1995]:
Although these strategies are fairly successful on usual C programs, they do not address significant cases. A C shared library, for instance, may define the following global variables:
char *a = "abcdef"; double (*b)(double) = &cos;
The semantics of the C language implies that both variables a and b must contain absolute pointers to either the string "abcdef" or the library function cos(). Both variables will require a patch when the shared library will be used.
Remark:
It is interesting to notice that the example above would almost work if
variable b contained the address of a PLT entry for function
cos() instead of the absolute address of function cos().
The shared library that initializes variable b can indeed call the
cosine function via the function pointer. Calling it from another shared
library does not work: the GOT register contains a pointer to the other shared
library, while the PLT stub expects that it contains a pointer to the shared
library that initializes b.
It is unfortunate that the C++ language relies a lot more on
constructs that are poorly handled by the runtime linking
speedup strategies discussed above.
Waldo Bastian [Bastian-2001] has investigated the run-time
linking performance of various components of the KDE desktop
environment (a large collection of C++ shared libraries and
executable). He pinpoints the performance issue to the
resolution of the virtual table entries:
In his own words:
"
Is this a KDE specific problem? That depends on how you look at it. The
problem of long startup times is caused by the linker needing to do many
relocations. In the case of KDE, these relocations are for a great deal caused
by vtable entries. KDE and Qt make liberal use of virtual functions and
inheritance. In general every class that inherits from a base class with
virtual functions will need to provide a vtable that covers at least the
virtual functions of this base class. It seems that each vtable entry causes a
relocation, and worse, a relocation that can not be done lazy.
"
Waldo Bastian suggests two strategies to improve the runtime linking
performance: (a) improving the efficiency of C++ virtual table
linking, and (b) pre-relocation and pre-linking of shared libraries.
This paper explores the first strategy in section 4
and 5. Jelinek's ELF pre-linker is briefly reviewed
in section 6 in order to discuss the pros and cons of
both approaches.
Each virtual table entry requires a separate relocation record. The runtime
linker processes each relocation record by performing one symbol resolution
and applying the corresponding patch.
In a large class hierarchy, the number of distinct symbols referenced by the
virtual table entries is usually much lower than the number of virtual table
entries themselves. All the derived class virtual tables to a large extent
replicate the virtual tables of their base classes. For instance, almost
every class in the KDE C++ code inherits about one hundred virtual functions
from class QWidget. Every corresponding virtual table contains about
one hundred identical entries. Each of these virtual table entries
is separately resolved by the runtime linker.
The objprelink-1 program [Bottou-2001] provides
a first method for avoiding this waste of CPU cycles. Virtual table entries
now point to a relay stub jumping to the target function. All virtual table
entries addressing the same function use the same relay stub.
There is still a relocation record per virtual table entry, because each
virtual table entry contains the absolute address of a relay stub. The
runtime linker computes this address without performing a symbol
resolution, merely by adding a displacement to the base address of the shared
library. The relay stub, on the other hand, requires a relocation record
involving a symbol resolution. But the number of relay stubs is much lower
than the number of virtual table entries.
The program objprelink-1 (object file pre-linker) achieves this result
by post-processing the object files produced by the compiler in a way that
coerces the GNU link editor ld into producing the modified shared
libraries. Measurements on various KDE2 applications [Bottou-2001] have shown a significant reduction of the
runtime linking time. The perceptual responsiveness of the KDE2 desktop
environment was significantly improved by using objprelink1.
Recent versions of the Linux operating
system provide a similar speedup using a scheme initially implemented in
the Solaris 7 operating system. The method relies on modifications in both the
link editor and the runtime linker [Jelinek-2001b].
The performance of both methods has been evaluated using the
small benchmark program proposed in [Bastian-2001].
A collection of shared libraries of increasing complexity
are compiled from the following source code:
Each shared library testclassN.so contains N class
definition. Each class inherits about one hundred virtual member functions
from the QWidget class defined by Trolltech Qt-3.0.3 library.
The following table compares the execution times of C++ programs composed of
an empty main function linked with the test shared libraries.
Execution times are averaged on 100 runs on a 600 MHz Pentium III laptop
computer running Red Hat Linux 7.2.
Four cases are considered according to the presence of option
-z combreloc and to the use of the object pre-linker
objprelink1. All tests use the same source code, the same compiler,
and the same version of the Qt library. Execution time differences are
therefore caused by differences in the runtime linking time of the
test shared libraries.
3- Runtime linking and C++.
4- Reducing the complexity.
4.1- objprelink-1
4.2- combreloc
The presence of both modifications ensures that relocation records involving
the same symbol resolution require only one execution of the actual symbol
resolution code.
4.3- Performance
#include <qwidget.h>
template<int T> struct testclass : public QWidget {
virtual void setSizeIncrement(int w, int h)
{ QWidget::setSizeIncrement(w+T, h+T); }
};
template class testclass<1>;
...
template class testclass<N>;
testclass*.so | nocombreloc | combreloc | objprelink1+ nocombreloc | objprelink1+ combreloc |
---|---|---|---|---|
testclass1.so | 46 | 47 | 45 | 45 |
testclass2.so | 47 | 47 | 46 | 47 |
testclass5.so | 48 | 47 | 47 | 48 |
testclass10.so | 50 | 48 | 48 | 48 |
testclass20.so | 54 | 49 | 49 | 48 |
testclass50.so | 66 | 52 | 52 | 52 |
testclass500.so | 275 | 101 | 100 | 91 |
Both the objprelink1 and combreloc methods improve the runtime linking time by reducing the number of actual symbol resolutions. Further improvements can be achieved by providing lazy symbol resolution for virtual table entries.
Lazy symbol resolution improves the responsiveness of interactive programs because the cost of symbol resolution is distributed over the duration of the program execution. It also improves the overall runtime linking time because few programs actually use all the functions provided by the needed shared libraries. On our test system (Red Hat Linux 7.2 with KDE3), the konqueror web browser and its needed shared libraries reference 40021 distinct symbols in their relocation records. A typical session however only uses 17919 distinct symbols. More interestingly, the konqueror window appears after using only 13214 distinct symbols.
It seems very easy to obtain lazy symbol resolution for virtual table entries. We must first create a PLT entry for each distinct symbol in the virtual tables, and have each virtual table entry point to the corresponding PLT entry stub. The first call to a virtual function will invoke the runtime linker. Subsequent calls will directly jump to the target function.
Yet a difficult problem remains. As discussed by a remark in section 2, the PLT stub code expects that the got pointer register contains a pointer to the GOT of the shared library containing the PLT entry. This is true when the virtual function is called from code contained in this same shared library. This is no longer true when the virtual function is called from the executable program or from another shared library. Such a call usually causes a program crash.
Program objprelink2 solves the problem by inserting fake relocation records that cause the runtime linker to patch the relevant PLT entries and make them work regardless of the contents of the got pointer register.
The standard i386 stub code for a shared library PLT is laid out according to the following template. The got pointer register is named %ebx.
PLT0: pushl 4(%ebx) // Code for entering runtime linker jmp *8(%ebx) // Code for entering runtime linker .... // Beginning of PLT stub i. PLTi: jmp *offset_of_got_entry(%ebx) // This GOT entry initially points // to the following instruction. pushl $relocation_record_for_got_entry jmp PLT0 // This jump enters the runtime linker code that // updates the GOT entry and calls the target. // Subsequent calls to this PLT entry will // then jump directly to the target function.
Program objprelink2 redefines the relevant PLT entries according to the following model.
PLT0: pushl GOT+4 jmp *GOT+8 .... PLTi: jmp *GOT+offset_of_got_entry pushl $relocation_record_for_got_entry jmp PLT0
References to the got pointer register have been replaced by the absolute address of the relevant GOT entries. Such a construct is normally used in executable programs. Since shared libraries can be loaded at variable addresses, objprelink2 creates additional relocation records that will adjust these absolute addresses according to the actual loading address of the shared library.
The program objprelink2 performs these modifications by preprocessing the object files and post processing the shared library. Additional details can be found here. It would be considerably more elegant and more efficient to perform these operations inside the GNU link editor ld. Yet objprelink2 is sufficient for verifying the concept and measuring the resulting performance.
The performance of the objprelink2 method can be measured using the test shared libraries testclassN.so presented in section 4.3. A first measurement counts the number of relocation records of each type in the test shared libraries testclassN.so. Four record types are relevant for Intel processors:
R_386_RELATIVE | These records describe patches than only depend on the loading address of the shared library and therefore require no symbol resolution. |
R_386_32 | These records describe patches that require a symbol resolution when the program starts up. Virtual table entries are usually constructed using these relocation records. |
R_386_JUMP_SLOT | These records describe GOT patches that involve a lazy symbol resolution. These relocation records are not processed when the program starts up, but when the corresponding function is first invoked. |
R_386_GLOB_DATA | These records describe GOT patches that require a symbol resolution when the program starts up. These records are listed for reference and remained unchanged. |
The following tables present the relocation record counts for the test shared libraries compiled as usual, using objprelink1, or using objprelink2.
R_386_32 | R_386_GLOB_DAT | R_386_JUMP_SLOT | R_386_RELATIVE | |
---|---|---|---|---|
testclass1.so | 121 | 9 | 8 | 3 |
testclass2.so | 242 | 13 | 8 | 3 |
testclass5.so | 605 | 25 | 8 | 3 |
testclass10.so | 1210 | 45 | 8 | 3 |
testclass20.so | 2420 | 85 | 8 | 3 |
testclass50.so | 6050 | 205 | 8 | 3 |
testclass500.so | 60500 | 2005 | 8 | 3 |
R_386_32 | R_386_GLOB_DAT | R_386_JUMP_SLOT | R_386_RELATIVE | |
---|---|---|---|---|
testclass1.so | 120 | 9 | 8 | 123 |
testclass2.so | 125 | 13 | 8 | 243 |
testclass5.so | 140 | 25 | 8 | 603 |
testclass10.so | 165 | 45 | 8 | 1203 |
testclass20.so | 215 | 85 | 8 | 2403 |
testclass50.so | 365 | 205 | 8 | 6003 |
testclass500.so | 2615 | 2005 | 8 | 60003 |
R_386_32 | R_386_GLOB_DAT | R_386_JUMP_SLOT | R_386_RELATIVE | |
---|---|---|---|---|
testclass1.so | 1 | 9 | 127 | 244 |
testclass2.so | 2 | 13 | 131 | 368 |
testclass5.so | 5 | 25 | 143 | 740 |
testclass10.so | 10 | 45 | 163 | 1360 |
testclass20.so | 20 | 85 | 203 | 2600 |
testclass50.so | 50 | 205 | 323 | 6320 |
testclass500.so | 500 | 2005 | 2123 | 62120 |
These counts illustrate how both methods work:
The following table compares the execution times of C++ programs composed of an empty main function linked with the test shared libraries testclassN.so. Execution times are averaged on 100 runs on a 600 MHz Pentium III laptop computer running Red Hat Linux 7.2.
Four cases are considered, depending on whether the Qt library and/or the test shared library were compiled using the default settings or using objprelink2. The combreloc option and the symbol resolution cache (see section 4.2) were enabled in all experiments.
libqt-mt.so testclass*.so | default default | default objprelink2 | objprelink2 default | objprelink2 objprelink2 |
---|---|---|---|---|
testclass1.so | 93 | 92 | 47 | 46 |
testclass2.so | 93 | 94 | 47 | 46 |
testclass5.so | 94 | 92 | 47 | 46 |
testclass10.so | 93 | 92 | 48 | 46 |
testclass20.so | 95 | 93 | 49 | 47 |
testclass50.so | 100 | 97 | 52 | 49 |
testclass500.so | 144 | 125 | 101 | 75 |
Lazy linking virtual table entries brings another 50% speedup over previous improvements. Compiling the test shared libraries brings a speedup that is roughly proportional to the number of classes in the shared library. Switching from a normal version to a version of Qt compiled with objprelink2 saves about 40 milliseconds.
The following table counts the number of relocation records of each type in the Qt-3.0.3 library. The number of upfront symbol resolutions has been reduced from 26823 to 3640.
R_386_32 | R_386_GLOB_DAT | R_386_JUMP_SLOT | R_386_RELATIVE | |
---|---|---|---|---|
default | 24571 | 2252 | 6916 | 11356 |
objprelink2 | 1388 | 2252 | 12339 | 40750 |
The following table indicates the sizes of various sections of the Qt-3.0.3 shared library file.
Total | .plt | .text | .{*}data | .got | .rel.{*} | |
---|---|---|---|---|---|---|
default | 6660K | 110K | 4147K | 1010K | 37K | 361K |
objprelink2 | 6939K | 198K | 4209K | 1055K | 58K | 454K |
Several size increases can be observed:
Our final test consists in compiling all the KDE3 desktop environment using objprelink2. Libraries compiled with objprelink2 seem to perform as flawlessly as the originals on our test machine, a 600 MHz Pentium III laptop running Red Hat Linux 7.2.
The following printout shows typical statistics returned by the runtime linker when running the web browser konqueror, before and after recompiling Qt and KDE3 with objprelink2.
[leonb@fuji src]$ LD_DEBUG=statistics konqueror 32013: runtime linker statistics: 32013: total startup time in dynamic loader: 159462206 clock cycles 32013: time needed for relocation: 156734640 clock cycles (98.2%) 32013: number of relocations: 22312 32013: number of relocations from cache: 46825 32013: time needed to load objects: 2506857 clock cycles (1.5%)
[leonb@fuji leonb]$ LD_DEBUG=statistics konqueror 28630: runtime linker statistics: 28630: total startup time in dynamic loader: 73010700 clock cycles 28630: time needed for relocation: 70757732 clock cycles (96.9%) 28630: number of relocations: 8201 28630: number of relocations from cache: 4414 28630: time needed to load objects: 2014055 clock cycles (2.7%)
Multiple runs can return very different results for the time needed to load objects. The results for the time needed for relocation are very consistent. The numbers of relocations, of course, never change across multiple runs.
Recompiling both Qt and KDE3 with objprelink2 reduces the time needed for relocation by one half compared to combreloc alone. This represents about 150 milliseconds on our test machine. This result however does not account for the time needed to perform the lazy symbol resolutions.
A more interesting measurement evaluates the delay between the moment konqueror is launched and the moment konqueror starts waiting for user input. This startup time can be evaluated by tracing well chosen system calls using the program strace. The following table reports the konqueror startup times with different setups for lazy symbol resolution:
No lazy resolution, (using LD_BIND_NOW=1) |
3.07 |
Normal lazy resolution, (using combreloc). |
2.80 |
Normal lazy resolution, Qt compiled with objprelink2. |
2.78 |
Normal lazy resolution, Qt and KDE3 compiled with objprelink2. |
2.66 |
The konqueror startup time decreases by about 140 milliseconds when both Qt and KDE3 are recompiled using objprelink2. This is close to the results obtained from the runtime linker statistics. It also demonstrates that a large improvement in runtime linking speed no longer translates into obvious improvements in application startup times.
Only recompiling Qt does not bring significant speedups. This result probably means that a large proportion of Qt virtual table symbols need to be resolved during the application startup.
Follow this link to replicate.
Another way to improve the runtime linking consists in pre-relocating and pre-linking shared libraries. This approach is very nicely illustrated by Jelinek's ELF pre-linker [Jelinek-2001].
When a program starts, the runtime linker checks (a) that the executable program file contains an up-to-date conflict list, (b) that all the needed shared libraries have been pre-relocated and pre-linked, (c) that all the shared libraries can be loaded at their selected pre-relocation address, and (d) that no shared library has been modified since pre-linking object files that depend on them. When all conditions are verified, the runtime linker needs only apply the few patches described by the conflict list. This is obviously a very fast operation since it mostly consists in doing nothing. It also maximizes the number of virtual memory pages that can be shared between several instances of the same shared library running in different processes.
The price of this performance is a reduced flexibility. The runtime linker must revert to the usual processing as soon as one of the above condition is not met. Upgrading a minor shared library in the system is very likely to compromise the runtime linking speed of a large number of programs. The only way to restore the optimal speed then consists of running the pre-linker on all executables and all shared libraries in the system. This is a potentially expensive operation.
The results reported in this paper suggest a way to obtain nearly optimal runtime linking speed with a much smaller cost in flexibility.
The runtime linking properties of C++ programs has been described. Complexity reduction methods such as objprelink1 and combreloc have been compared and shown to be equivalent. Further speedups using lazy resolution of virtual table symbols have been demonstrated using objprelink2. Comparing these methods with the full ELF pre-linker has lead to suggesting a promising hybrid approach.
[ELF-1995]
TIS Committee: Tool Interface Standard (TIS) - Executable
and Linking Format (ELF) Specification - version 1.2,
May 1995, http://www.linuxbase.org/spec/refspecs/elf.pdf.
[Bastian-2001]
Waldo Bastian: Making C++ ready for the desktop,
May 2001, http://www.suse.de/~bastian/Export/linking.txt.
[Jelinek-2001]
Jakub Jelinek: ELF prelinking, June 2001, http://sources.redhat.com/ml/libc-alpha/2001-06/msg00113.html.
Additional info here.
[Bottou-2001]
Leon Bottou: Faster startups by fixing C++ object files,
July 2001, http://objprelink.sourceforge.net/index-1.html.
Additional info here.
[Jelinek-2001-b]
Jakub Jelinek: Support -z combreloc in binutils,
August-2001, http://sources.redhat.com/ml/binutils/2001-08/msg00311.html.