C's Curious Incapability

Suppose I have a service that controls lights. It's attached to a common struct interface, like this:

typedef struct light_controller light_controller_t;  
struct light_controller {  
    void (*switch)(light_controller_t *self, bool on);
}

I use it like this:

light_controller_t *lights = ...  
lights->switch(lights, false);

And it's implemented like this:

static void lights_switch_impl(light_controller_t *self, bool on) {  
    self->room->lit = on;
}

...

void lights_controller_default_init(lights_controller_t *self) {  
    self->switch = lights_switch_impl;
}

Neat. Okay. Function pointers in C are powerful stuff -- powerful enough that C++ was originally a preprocessor, not a language unto itself.

Here's the thing I want to do. Imagine lights is part of a hierarchy. We have a light_controller_t for a whole building, and one for every room. We manage this hierarchy separately. When we turn off a parent light controller, we want all the child lights to be explicitly switched off.

This is easy. We implement the switch function to do its own work, and call the function for any child.

static void lights_switch_impl(light_controller_t *self, bool on) {  
    self->room->lit = on;
    for (light_controller_t *child = self->child; child; child = child->next) {
        child->switch(child, on);
    }
}

Suppose then that we implement a lot of services, controlling windows, radios, ovens, etc. There are a lot of function calls here. Some parts of the hierarchy won't have some services (e.g. first_floor.services[oven_controller]) but will have children that do (i.e. first_floor.children[kitchen].services[oven_controller]).

You can think of this as a tree of composed service nodes. Each node in the tree is a service that may or may not implement certain interfaces. The little services might be pretty simple.

But what do we do when a node doesn't implement a service? We want child nodes implementing the service to receive the invocation. Something like this (pseudocode):

function service_node.get_service(service_type):  
    if (self implements service_type)
        return self as service_type
    else
        return service_proxy(self, service_type) as service_type
    end
end

function service_proxy(parent, service_type):  
    self = new service_type()
    for (method in service_type.methods)
        method = function(args...):
            for (child in parent.children)
                child.method(args)
            end
        end
    end
    return self
end

Now, if you know something about C, you know that there isn't any easy way to do this. For starters, the language features aren't present: we've got a closure we can work around, but we've also got something that looks like varargs but can't be, because varargs in C are a dead end that can't generically "splat" back to arguments.

"But Justin," you say (startling me), "this is a clear case of using the wrong language for the task." And that is certainly true for the task I have described. Rejoice, Rubyists (or "Rubes," as they are popularly known):

class Proxy  
  def initialize(owner)
    @owner = owner
  end

  private
    def method_missing(method, *args, &block)
      @owner.children.each { |child|
        child.send(method, *args, &block)
      }
    end
end

I know. I know! Awesome! (Other languages can do this too; for example, PHP has __call, which does the same job but provides 96% less smug satisfaction).

The problem is that we're dealing with an isolated hypothetical, and while Ruby may or may not be 67 times slower than C, it sure isn't faster. The real-life version of the problem I'm describing has to do this resolution thousands of times per second at a speed that won't be more than a blip in profiling, or I'll have to rip it out. Dynamic Language Du Jour isn't the answer. I only need tiny subset of dynamic language features for some limited application, providing all the speed that comes from that lack of flexibility. An abomination is only abominable in proportion to its uselessness!

(As a digression, I believe the previous sentence should be in the foreword of any book about C++.)

But Ruby's original interpreter, MRI, is written in C. Everything is possible in C + inline asm for a sufficiently painful definition of "possible," right? Isn't this all just a bunch of registers, some instructions, a heap and a stack? All we want to do (I think) is forward an invocation, exactly as-is, to an address we provide at runtime. We can refresh our memory of calling convention and patch up a generic call ourselves using pointer math. No?

function service_proxy(parent, service_desc):  
    proxy_method = pseudo_asm(self, args):
        args = stack_pop()                      // take args
        self = stack_pop()                      // take self
        for (child in parent.children)
            stack_push(args)                    // copy args
            stack_push(child)                   // self pointer will be child instead
            next_eip = (eip - self) + child     // uh, maybe something like this
            call next_eip                       // magic! :)
                                                // ...wait, who cleans up the stack again?
        end
    end
    for (method in service_desc.methods)
        method = proxy_method
    end
    return self
end

No.

Obviously this only looks easy because this pseudocode pretends it's O.G., using terms like eip, while it totally glosses things like the for each loop, and how it knows sizeof(args), and its recent One Direction collaboration. We could leave extra stack space -- say, enough for ten doubles. This doesn't look impossible.

Crap, the function might return a value. How much do we copy back?

Uhh.
Unnng.
Stone age computing hurt brain!

Let's JFGI.

Here's an actual i386/x86_64 implementation (condensed), courtesy of StackOverflow user Coltox, who looks like he signed up for the express purpose of ninja-ing this answer at all the people who said it couldn't be done. It can be done, and to answer the implicit question, it's very painful.

The code that follows splats one varargs call to one function invocation, the signature of which is available at compile time.

#include <limits.h>
#include <stdint.h>
#include <alloca.h>
#include <inttypes.h>
#include <string.h>

/* Currently we don't care about floating point arguments and
 * we assume that the standard calling conventions are used.
 *
 * The wrapper function has to start with VA_WRAP_PROLOGUE()
 * and the original function can be called by
 * VA_WRAP_CALL(function, ret), whereas the return value will
 * be stored in ret.  The caller has to provide ret
 * even if the original function was returning void.
 */

#define __VA_WRAP_CALL_FUNC __attribute__ ((noinline))
#define VA_WRAP_CALL_COMMON()                                        \
    uintptr_t va_wrap_this_bp,va_wrap_old_bp;                        \
    va_wrap_this_bp  = va_wrap_get_bp();                             \
    va_wrap_old_bp   = *(uintptr_t *) va_wrap_this_bp;               \
    va_wrap_this_bp += 2 * sizeof(uintptr_t);                        \
    size_t volatile va_wrap_size = va_wrap_old_bp - va_wrap_this_bp; \
    uintptr_t *va_wrap_stack = alloca(va_wrap_size);                 \
    memcpy((void *) va_wrap_stack,                                   \
        (void *)(va_wrap_this_bp), va_wrap_size);

#if ( __WORDSIZE == 64 )

/* System V AMD64 AB calling convention */
static inline uintptr_t __attribute__((always_inline)) va_wrap_get_bp()  
{
    uintptr_t ret;
    asm volatile ("mov %%rbp, %0":"=r"(ret));
    return ret;
}

#define VA_WRAP_PROLOGUE()           \
    uintptr_t va_wrap_ret;           \
    uintptr_t va_wrap_saved_args[7]; \
    asm volatile  (                  \
    "mov %%rsi,     (%%rax)\n\t"     \
    "mov %%rdi,  0x8(%%rax)\n\t"     \
    "mov %%rdx, 0x10(%%rax)\n\t"     \
    "mov %%rcx, 0x18(%%rax)\n\t"     \
    "mov %%r8,  0x20(%%rax)\n\t"     \
    "mov %%r9,  0x28(%%rax)\n\t"     \
    :                                \
    :"a"(va_wrap_saved_args)         \
    );

#define VA_WRAP_CALL(func, ret)            \
    VA_WRAP_CALL_COMMON();                 \
    va_wrap_saved_args[6] = (uintptr_t)va_wrap_stack;  \
    asm volatile (                         \
    "mov      (%%rax), %%rsi \n\t"         \
    "mov   0x8(%%rax), %%rdi \n\t"         \
    "mov  0x10(%%rax), %%rdx \n\t"         \
    "mov  0x18(%%rax), %%rcx \n\t"         \
    "mov  0x20(%%rax),  %%r8 \n\t"         \
    "mov  0x28(%%rax),  %%r9 \n\t"         \
    "mov           $0, %%rax \n\t"         \
    "call             *%%rbx \n\t"         \
    : "=a" (va_wrap_ret)                   \
    : "b" (func), "a" (va_wrap_saved_args) \
    :  "%rcx", "%rdx",                     \
      "%rsi", "%rdi", "%r8", "%r9",        \
      "%r10", "%r11", "%r12", "%r14",      \
      "%r15"                               \
    );                                     \
    ret = (typeof(ret)) va_wrap_ret;

#else
/* x86 stdcall */
static inline uintptr_t __attribute__((always_inline)) va_wrap_get_bp()  
{
    uintptr_t ret;
    asm volatile ("mov %%ebp, %0":"=a"(ret));
    return ret;
}

#define VA_WRAP_PROLOGUE() uintptr_t va_wrap_ret;

#define VA_WRAP_CALL(func, ret)        \
    VA_WRAP_CALL_COMMON();             \
    asm volatile (                     \
    "mov    %2, %%esp \n\t"            \
    "call  *%1        \n\t"            \
    : "=a"(va_wrap_ret)                \
    : "r" (func),                      \
      "r"(va_wrap_stack)               \
    : "%ebx", "%ecx", "%edx"   \
    );                                 \
    ret = (typeof(ret))va_wrap_ret;
#endif

Uhng. How it work?

int __VA_WRAP_CALL_FUNC proxy_printf(char *str, ...)  
{
    VA_WRAP_PROLOGUE();
    int ret;
    VA_WRAP_CALL(printf, ret);
    printf("printf returned with %d \n", ret);
    return ret;
}

Ladies and Gentlemen...printf.

(Bill: "I know, let's make a proxy to printf the return code from printf!" Carol: "Go home Bill, you're drunk.")

Still! Now we have a working solution! Let's hook it up!

...except that floats get passed in special float registers: don't pass any floats. Also, it doesn't do __cdecl or __fastcall. Also, it doesn't do ARM or PPC. Also, THIS IS MORE WORK. Remember our "manual" proxy?

static void lights_switch_impl(light_controller_t *self, bool on) {  
    self->room->lit = on;
    for (light_controller_t *child = self->child; child; child = child->next) {
        child->switch(child, on);
    }
}

Looks pretty good right now, doesn't it?