A
few articles ago we saw how gcc might need some help when mixing template instanciation (pure compile time data) with function calls (deducible compile time information, but not available to the template expander). Now we'll go one step further and combine all three types: pure compile time data, deducible compile time data and pure run time data (*). Just to annoy the compiler, and to see how gcc is able to optimize the results.
Let's build a simple example, similar to what we used last time: an object that will determine the range of an integer and then invoke a callback with the closest range. Something like this could be used, for example, to allocate a buffer.
void boring(int x, func f) {
if (x < 2) {
f(2);
} else if (x < 4) {
f(4);
} else if (x < 8) {
f(8);
} else if (x < 16) {
// You get the idea...
}
}
Can we build a prettier template version of this code, without any overhead? Let's try:
typedef void (*func)(int);
template <int My_Size>
struct Foo {
void bar(size_t size, func callback) {
if (size > My_Size) {
callback(My_Size);
} else {
next_foo.bar(size, callback);
}
}
Foo<My_Size/2> next_foo;
};
// Stop condition
template<> struct Foo<0> {
void bar(size_t, func) { }
};
void wrapper(int x, func f) {
Foo<512> jump_table;
jump_table.bar(x, f);
}
And now, let's compile like as "g++ -fverbose-asm -S -O0 -c foo.cpp -o /dev/stdout | c++filt". You'll see something like this:
wrapper(int, void (*)(int)):
call Foo<512>::bar(unsigned long, void (*)(int))
Foo<512>::bar(unsigned long, void (*)(int)):
cmpq $512, %rsi #, size
jbe .L4
call *%rdx # callback
jmp .L3
.L4:
call Foo<256>::bar(unsigned long, void (*)(int)) #
.L3:
leave
Foo<256>::bar(unsigned long, void (*)(int)):
cmpq $256, %rsi #, size
jbe .L4
call *%rdx # callback
jmp .L3
.L4:
call Foo<128>::bar(unsigned long, void (*)(int)) #
.L3:
leave
# You get the idea, right?
Foo<0>::bar(unsigned long, void (*)(int)):
# Stop condition, do nothing
That doesn't look too good, does it? We don't need to worry: we already learned that gcc needs help from the optimizer to handle template expansion and non static function calls. Let's move to O1:
rapper(int, void (*)(int)):
.LFB14:
cmpq $512, %rdi #, D.2974
jbe .L2 #,
movl $512, %edi #,
call *%rsi # f
jmp .L1 #
.L2:
cmpq $256, %rdi #, D.2974
jbe .L4 #,
movl $256, %edi #,
call *%rsi # f
jmp .L1 #
# Again, it should be clear what's going on...
.L11:
cmpq $1, %rdi #, D.2974
.p2align 4,,2
jbe .L1 #,
movl $1, %edi #,
.p2align 4,,2
call *%rsi # f
.L1:
It's better than last time, but it doesn't look great either: gcc managed to inline all calls, but it stopped there. Let's move to O2 and see what happens:
wrapper(int, void (*)(int)):
movslq %edi, %rdi # x, D.2987
cmpq $512, %rdi #, D.2987
ja .L13 #,
cmpq $256, %rdi #, D.2987
ja .L14 #,
[ .... ]
cmpq $2, %rdi #, D.2987
ja .L21 #,
.L13:
movl $512, %edi #,
jmp *%rsi # f
.L14:
movl $256, %edi #,
jmp *%rsi # f
[ .... ]
.L21:
movl $2, %edi #,
jmp *%rsi # f
.L1:
rep
ret
.p2align 4,,10
.p2align 3
Now, that looks much better. And we can now see that gcc generates the same code at -O2 for both versions of our code.
(*) Just for the sake of completion:
- Pure compile time data is information directly available during compilation time, like a constant.
- Deducible compile time data means something that can easily be deduced, like a function call to a non virtual method.
- Run-time only data means something that a compiler could never deduce, like a volatile variable or the parameter of a function called from outside the current translation unit.