C++: A jump table with a template device

Tuesday, 5 May 2015

C++: A jump table with a template device

A few articles ago we saw how gcc might need some help when mixing template instanciation (pure compile time data) with function calls (deducible compile time information, but not available to the template expander). Now we'll go one step further and combine all three types: pure compile time data, deducible compile time data and pure run time data (*). Just to annoy the compiler, and to see how gcc is able to optimize the results.

Let's build a simple example, similar to what we used last time: an object that will determine the range of an integer and then invoke a callback with the closest range. Something like this could be used, for example, to allocate a buffer.

void boring(int x, func f) {
    if (x < 2) {
        f(2);
    } else if (x < 4) {
        f(4);
    } else if (x < 8) {
        f(8);
    } else if (x < 16) {
        // You get the idea...
    }
}

Can we build a prettier template version of this code, without any overhead? Let's try:

typedef void (*func)(int);

template <int My_Size>
struct Foo {
    void bar(size_t size, func callback) {
        if (size > My_Size) {
            callback(My_Size);
        } else {
            next_foo.bar(size, callback);
        }
    }

    Foo<My_Size/2> next_foo;
};

// Stop condition
template<> struct Foo<0> {
    void bar(size_t, func) { }
};

void wrapper(int x, func f) {
    Foo<512> jump_table;
    jump_table.bar(x, f);
}

And now, let's compile like as "g++ -fverbose-asm -S -O0 -c foo.cpp -o /dev/stdout | c++filt". You'll see something like this:

wrapper(int, void (*)(int)):
    call    Foo<512>::bar(unsigned long, void (*)(int))

Foo<512>::bar(unsigned long, void (*)(int)):
    cmpq    $512, %rsi    #, size
    jbe    .L4
    call    *%rdx    # callback
    jmp    .L3
.L4:
    call    Foo<256>::bar(unsigned long, void (*)(int))    #
.L3:
    leave

Foo<256>::bar(unsigned long, void (*)(int)):
    cmpq    $256, %rsi    #, size
    jbe    .L4
    call    *%rdx    # callback
    jmp    .L3
.L4:
    call    Foo<128>::bar(unsigned long, void (*)(int))    #
.L3:
    leave

# You get the idea, right?

Foo<0>::bar(unsigned long, void (*)(int)):
    # Stop condition, do nothing

That doesn't look too good, does it? We don't need to worry: we already learned that gcc needs help from the optimizer to handle template expansion and non static function calls. Let's move to O1:

rapper(int, void (*)(int)):
.LFB14:
    cmpq    $512, %rdi    #, D.2974
    jbe    .L2    #,
    movl    $512, %edi    #,
    call    *%rsi    # f
    jmp    .L1    #
.L2:
    cmpq    $256, %rdi    #, D.2974
    jbe    .L4    #,
    movl    $256, %edi    #,
    call    *%rsi    # f
    jmp    .L1    #

# Again, it should be clear what's going on...

.L11:
    cmpq    $1, %rdi    #, D.2974
    .p2align 4,,2
    jbe    .L1    #,
    movl    $1, %edi    #,
    .p2align 4,,2
    call    *%rsi    # f
.L1:

It's better than last time, but it doesn't look great either: gcc managed to inline all calls, but it stopped there. Let's move to O2 and see what happens:


wrapper(int, void (*)(int)):
    movslq    %edi, %rdi    # x, D.2987
    cmpq    $512, %rdi    #, D.2987
    ja    .L13    #,
    cmpq    $256, %rdi    #, D.2987
    ja    .L14    #,
    [ .... ]
    cmpq    $2, %rdi    #, D.2987
    ja    .L21    #,

.L13:
    movl    $512, %edi    #,
    jmp    *%rsi    # f

.L14:
    movl    $256, %edi    #,
    jmp    *%rsi    # f

[ .... ]

.L21:
    movl    $2, %edi    #,
    jmp    *%rsi    # f

.L1:
    rep
    ret
    .p2align 4,,10
    .p2align 3

Now, that looks much better. And we can now see that gcc generates the same code at -O2 for both versions of our code.

(*) Just for the sake of completion:

Pure compile time data is information directly available during compilation time, like a constant.
Deducible compile time data means something that can easily be deduced, like a function call to a non virtual method.
Run-time only data means something that a compiler could never deduce, like a volatile variable or the parameter of a function called from outside the current translation unit.

8 comments:

ploxiln6 May 2015 at 05:33
Maybe I'm being dumb, but I don't think

if (size > My_Size) {
callback(My_Size);

is the same logic as boring(). it would have to be something like

if (size >= My_Size/2) {
callback(My_Size);

(or refactored more significantly)
ReplyDelete
Replies
Anonymous6 May 2015 at 08:11
This is not a jump table --- the resulting code still contains a chain of conditional jumps. A jump table uses an array of function pointers and a given value is used as an index into this array. In general, your approach will need N comparisons for N if cases whereas a true jump table will just do a one memory access and a jump to a register value regardless of a number of cases.
ReplyDelete
Replies
Rob G6 May 2015 at 13:33
I really don't see how anyone sane could regard the template version as "prettier".
ReplyDelete
Replies
robdesbois7 May 2015 at 10:20
For interest's sake: is there a particular reason to prefer power-of-2 buffer sizes over exponentially growing from a non-power? I.e. if initial x is 3, is it really better to calculate the next power so you can allocate 4,8,16, 32... than just 3,6,12,24...?
ReplyDelete
Replies
ploxiln9 May 2015 at 00:00
Re: why power-of-2:

A natural checkpoint in the sizing is 4KiB (which is a power of 2), because that's a memory page size (for most architectures) (because they can just mask the lower 12 bits to identify the page). The OS actually allocates memory to a process at the finest granularity of 4KiB. That's also a good point to switch allocation strategies, maybe use mmap() to get a stand-alone chunk of memory, instead of something like sbrk() to extend the old-style heap region.
ReplyDelete
Replies
nicolasbrailo27 May 2015 at 11:46
Nothing more to add to ploxiln's comment. I initially worked on this snippet because I had to do some bucketing similar to what jemalloc does (I'm not sure how jemalloc implements their buckets, though)
ReplyDelete
Replies
nicolasbrailo27 May 2015 at 11:49
It is very much a matter of personal opinion, but your point of view can change rapidly once you discover that someone typo'd a 32 for a 23 and you had to spend a day trying to figure out why some of the buckets are broken :)
ReplyDelete
Replies
robdesbois27 May 2015 at 13:03
Ahh that makes sense, thanks for the info.
ReplyDelete
Replies

Add comment

Infinite monkey - Nico Brailovsky's blog

syntax highlight

Tuesday, 5 May 2015

C++: A jump table with a template device

8 comments: