Let's build a simple example, similar to what we used last time: an object that will determine the range of an integer and then invoke a callback with the closest range. Something like this could be used, for example, to allocate a buffer.
void boring(int x, func f) {
if (x < 2) {
f(2);
} else if (x < 4) {
f(4);
} else if (x < 8) {
f(8);
} else if (x < 16) {
// You get the idea...
}
}
Can we build a prettier template version of this code, without any overhead? Let's try:
typedef void (*func)(int);
template <int My_Size>
struct Foo {
void bar(size_t size, func callback) {
if (size > My_Size) {
callback(My_Size);
} else {
next_foo.bar(size, callback);
}
}
Foo<My_Size/2> next_foo;
};
// Stop condition
template<> struct Foo<0> {
void bar(size_t, func) { }
};
void wrapper(int x, func f) {
Foo<512> jump_table;
jump_table.bar(x, f);
}
And now, let's compile like as "g++ -fverbose-asm -S -O0 -c foo.cpp -o /dev/stdout | c++filt". You'll see something like this:
wrapper(int, void (*)(int)):
call Foo<512>::bar(unsigned long, void (*)(int))
Foo<512>::bar(unsigned long, void (*)(int)):
cmpq $512, %rsi #, size
jbe .L4
call *%rdx # callback
jmp .L3
.L4:
call Foo<256>::bar(unsigned long, void (*)(int)) #
.L3:
leave
Foo<256>::bar(unsigned long, void (*)(int)):
cmpq $256, %rsi #, size
jbe .L4
call *%rdx # callback
jmp .L3
.L4:
call Foo<128>::bar(unsigned long, void (*)(int)) #
.L3:
leave
# You get the idea, right?
Foo<0>::bar(unsigned long, void (*)(int)):
# Stop condition, do nothing
That doesn't look too good, does it? We don't need to worry: we already learned that gcc needs help from the optimizer to handle template expansion and non static function calls. Let's move to O1:
rapper(int, void (*)(int)):
.LFB14:
cmpq $512, %rdi #, D.2974
jbe .L2 #,
movl $512, %edi #,
call *%rsi # f
jmp .L1 #
.L2:
cmpq $256, %rdi #, D.2974
jbe .L4 #,
movl $256, %edi #,
call *%rsi # f
jmp .L1 #
# Again, it should be clear what's going on...
.L11:
cmpq $1, %rdi #, D.2974
.p2align 4,,2
jbe .L1 #,
movl $1, %edi #,
.p2align 4,,2
call *%rsi # f
.L1:
It's better than last time, but it doesn't look great either: gcc managed to inline all calls, but it stopped there. Let's move to O2 and see what happens:
wrapper(int, void (*)(int)):
movslq %edi, %rdi # x, D.2987
cmpq $512, %rdi #, D.2987
ja .L13 #,
cmpq $256, %rdi #, D.2987
ja .L14 #,
[ .... ]
cmpq $2, %rdi #, D.2987
ja .L21 #,
.L13:
movl $512, %edi #,
jmp *%rsi # f
.L14:
movl $256, %edi #,
jmp *%rsi # f
[ .... ]
.L21:
movl $2, %edi #,
jmp *%rsi # f
.L1:
rep
ret
.p2align 4,,10
.p2align 3
Now, that looks much better. And we can now see that gcc generates the same code at -O2 for both versions of our code.
(*) Just for the sake of completion:
- Pure compile time data is information directly available during compilation time, like a constant.
- Deducible compile time data means something that can easily be deduced, like a function call to a non virtual method.
- Run-time only data means something that a compiler could never deduce, like a volatile variable or the parameter of a function called from outside the current translation unit.
Maybe I'm being dumb, but I don't think
ReplyDeleteif (size > My_Size) {
callback(My_Size);
is the same logic as boring(). it would have to be something like
if (size >= My_Size/2) {
callback(My_Size);
(or refactored more significantly)
This is not a jump table --- the resulting code still contains a chain of conditional jumps. A jump table uses an array of function pointers and a given value is used as an index into this array. In general, your approach will need N comparisons for N if cases whereas a true jump table will just do a one memory access and a jump to a register value regardless of a number of cases.
ReplyDeleteI really don't see how anyone sane could regard the template version as "prettier".
ReplyDeleteFor interest's sake: is there a particular reason to prefer power-of-2 buffer sizes over exponentially growing from a non-power? I.e. if initial x is 3, is it really better to calculate the next power so you can allocate 4,8,16, 32... than just 3,6,12,24...?
ReplyDeleteRe: why power-of-2:
ReplyDeleteA natural checkpoint in the sizing is 4KiB (which is a power of 2), because that's a memory page size (for most architectures) (because they can just mask the lower 12 bits to identify the page). The OS actually allocates memory to a process at the finest granularity of 4KiB. That's also a good point to switch allocation strategies, maybe use mmap() to get a stand-alone chunk of memory, instead of something like sbrk() to extend the old-style heap region.
Nothing more to add to ploxiln's comment. I initially worked on this snippet because I had to do some bucketing similar to what jemalloc does (I'm not sure how jemalloc implements their buckets, though)
ReplyDeleteIt is very much a matter of personal opinion, but your point of view can change rapidly once you discover that someone typo'd a 32 for a 23 and you had to spend a day trying to figure out why some of the buckets are broken :)
ReplyDeleteAhh that makes sense, thanks for the info.
ReplyDelete