Post

Variadic Switch

Several years back I found an interesting question on Reddit. Essentially the author asks why there is no way to expand a pack into a sequence of case labels followed by a statement. It was illustrated with the following imaginary syntax:

1
2
3
4
5
6
7
8
9
10
11
template <class Visitor, class Variant, std::size_t ... Is>
auto visit_impl(Visitor visitor, Variant && variant, std::index_sequence<Is...>) {
    auto i = variant.index();
    switch (i) { (case Is: return visitor(get<Is>(variant));)... }
}

template <class Visitor, class Variant>
auto visit(Visitor visitor, Variant&& variant) {
    return visit_impl(visitor, variant, 
                      std::make_index_sequence<std::variant_size_v<Variant>>{});
}

This poses an interesting challenge - how close can we get to generically generating something that optimizes to equivalently good assembly as a hand-written switch would?

And perhaps even more interestingly: Can C++26 features help with this?

Jump Tables

Some programming languages provide a direct or indirect way of generating what’s called a jump table or branch table. While C++’s switch statements are not required to be turned into jump tables by the compiler, most compilers will do so if the number of cases is large enough and optimization is enabled.

For simplicity let’s not care about a generic visitor just yet, let’s instead always call an undefined template function h:

1
2
template <int>
int h();

Since it is not defined compilers cannot possibly inline calls to it. This should come in handy once we look at the generated assembly (and optimization passes) to see if compilers were able to understand our code.

Consider a simple switch such as:

1
2
3
4
5
6
7
8
9
10
int mersenne(unsigned index){
  switch (index) {
    case 0: return h<2>();
    case 1: return h<3>();
    case 2: return h<5>();
    case 3: return h<7>();
    case 4: return h<13>();
    default: return -1;
  }
}

With optimizations enabled, GCC turns this into the following x86 assembly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
mersenne(unsigned int):
        cmp     edi, 4
        ja      .L2
        mov     edi, edi
        jmp     [QWORD PTR .L4[0+rdi*8]]
.L4:
        .quad   .L8
        .quad   .L7
        .quad   .L6
        .quad   .L5
        .quad   .L3
.L5:
        jmp     int h<7>()
.L3:
        jmp     int h<13>()
.L8:
        jmp     int h<2>()
.L7:
        jmp     int h<3>()
.L6:
        jmp     int h<5>()
.L2:
        mov     eax, -1
        ret

While it won’t help with solving the stated problem, we can approximately work backwards from the generated assembly by using GCC’s computed goto extension.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
template <auto> int h();

int mersenne(unsigned index){
  constexpr static void* jtable[] = 
  {                    //.L4:
    &&CASE_0,          //    .quad  .L8
    &&CASE_1,          //    .quad  .L7
    &&CASE_2,          //    .quad  .L6
    &&CASE_3,          //    .quad  .L5
    &&CASE_4,          //    .quad  .L3
  };

  if (index > 4)       // cmp  edi, 4
    goto DEFAULT;      // ja   .L2
  
  goto* jtable[index]; // jmp  [QWORD PTR .L4[0+rdi*8]]

CASE_0:                //.L8:
  return h<2>();       //    jmp  int h<2>()
CASE_1:                //.L7:
  return h<3>();       //    jmp  int h<3>()
CASE_2:                //.L6:
  return h<5>();       //    jmp  int h<5>()
CASE_3:                //.L5:
  return h<7>();       //    jmp  int h<7>()
CASE_4:                //.L3:
  return h<13>();      //    jmp  int h<13>()
DEFAULT:               //.L2:
  return -1;           //    mov  eax, -1
                       //    ret
}

Run on Compiler Explorer

This leads us to the first naive implementation strategy.

Dispatch Tables

To turn this back into valid C++, we need to create an array of function pointers instead of label addresses. Conveniently, grabbing a pointer to every specialization of h that we wish to handle is trivial.

This gets us what’s called a dispatch table. Our example from before can be rewritten as follows.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
int mersenne(unsigned index){
  using F = int();
  constexpr static F* table[] = { 
    &h<2>,
    &h<3>,
    &h<5>,
    &h<7>,
    &h<13>,
    &h<17>
  };  

  if (index > 4) {
    // boundary check
    return -1;
  }
  return table[index]();
}

More generically, we can expand packs to generate the table.

1
2
3
4
5
6
7
8
9
10
11
template <int... Is>
int visit(unsigned index) {
  using F = int();
  constexpr static F* table[] = {&h<Is>...};  

  if (index >= (sizeof table / sizeof* table)) {
    // boundary check
    return -1;
  }
  return table[index]();
}

Run on Compiler Explorer

Note that while this still produces good assembly for this trivial example, this quickly degenerates and emits a call instead of a jump.

Generic visit

However, this can already be used to implement single-variant visit. Since visit is a great usage example, we will look at a simplified implementation for each strategy.

In the visit example implementations for all strategies, get is used for clarity.

However, in the linked CE examples you will see __unchecked_get instead, which (for libc++) is essentially the same as get minus boundary checks.

This is mostly done to make the emitted assembly easier to read.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
template <typename F, typename V>
decltype(auto) visit(F&& visitor, V&& variant) {
  constexpr static auto max_index = std::variant_size_v<std::remove_cvref_t<V>>;
  constexpr static auto table = []<std::size_t... Idx>(std::index_sequence<Idx...>) {
    return std::array{+[](F&& visitor, V&& variant) {
      return std::invoke(std::forward<F>(visitor), 
                         get<Idx>(std::forward<V>(variant)));
    }...};
  }(std::make_index_sequence<max_index>());

  const auto index = variant.index();
  if (index >= table.size()) {
    // boundary check
    std::unreachable();
  }
  return table[index](std::forward<F>(visitor), std::forward<V>(variant));
}

Run on Compiler Explorer

Macros

Surely the closest thing to a switch is.. a switch. Trusty old macros might not spark joy, but by cleverly combining them with C++ features we can try something else: Generating a lot of cases and then discarding all those whose index exceeds the amount of cases we want to handle.

To do this we first need some way to emit cases with an increasing index. A set of macros like the following can help:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#define STAMP4(Offset, Fnc)  \
    Fnc(Offset)              \
    Fnc((Offset) + 1)        \
    Fnc((Offset) + 2)        \
    Fnc((Offset) + 3)

#define STAMP16(Offset, Fnc)  \
    STAMP4(Offset, Fnc)       \
    STAMP4((Offset) + 4, Fnc) \
    STAMP4((Offset) + 8, Fnc) \
    STAMP4((Offset) + 12, Fnc)

#define STAMP64(Offset, Fnc)    \
    STAMP16(Offset, Fnc)        \
    STAMP16((Offset) + 16, Fnc) \
    STAMP16((Offset) + 32, Fnc) \
    STAMP16((Offset) + 48, Fnc)

#define STAMP256(Offset, Fnc)    \
    STAMP64(Offset, Fnc)         \
    STAMP64((Offset) + 64, Fnc)  \
    STAMP64((Offset) + 128, Fnc) \
    STAMP64((Offset) + 192, Fnc)

#define STAMP(Count) STAMP##Count

With the STAMP macro in place, we can now define a macro to generate a case.

1
2
3
4
5
6
7
#define GENERATE_CASE(Idx)                                             \
  case (Idx):                                                          \
    if constexpr ((Idx) < max_index) {                                 \
      return std::invoke(std::forward<F>(visitor),                     \
                         get<Idx>(std::forward<Vs>(variant)));         \
    }                                                                  \
    std::unreachable();

To reduce the amount of branches that need to be discarded as much as possible, we can generate multiple explicit specializations of a class template that handle variants up to a specific size limit each.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#define GENERATE_STRATEGY(Idx, Stamper)                                 \
  template <>                                                           \
  struct VisitStrategy<Idx> {                                           \
    template <typename F, typename V>                                   \
    static constexpr decltype(auto) visit(F&& visitor, V&& variant) {   \
      switch (variant.index()) {                                        \
        Stamper(0, GENERATE_CASE);                                      \
        default:                                                        \
          std::unreachable();                                           \
      }                                                                 \
    }                                                                   \
  }

GENERATE_STRATEGY(1, STAMP(4));    // 4^1 potential states
GENERATE_STRATEGY(2, STAMP(16));   // 4^2 potential states
GENERATE_STRATEGY(3, STAMP(64));   // 4^3 potential states
GENERATE_STRATEGY(4, STAMP(256));  // 4^4 potential states

At this point we only need to select the appropriate specialization of VisitStrategy. Note that this can only handle variants with up to 256 alternatives - if you need to support more than that, you have to fall back to another strategy (ie. dispatch table).

1
2
3
4
5
6
7
8
9
10
11
template <typename F, typename V>
constexpr decltype(auto) visit(F&& visitor, V&& variant) {
  constexpr static auto max_index = std::variant_size_v<std::remove_cvref_t<V>>;
  using visit_helper              = VisitStrategy<max_index <= 4     ? 1
                                                  : max_index <= 16  ? 2
                                                  : max_index <= 64  ? 3
                                                                     : 4;

  return visit_helper::template visit(std::forward<F>(visitor),
                                      std::forward<Vs>(variant));
}

Recursive switch

This is nice, however you have been promised C++ metaprogramming magic, not boring old macros. Let’s explore another approach.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
template <int Offset, int Limit>
[[clang::always_inline]] inline int visit(unsigned index) {
  switch (index - Offset) {
    case 0:
      if constexpr (Offset < Limit) {
        return h<Offset>();
      }
    default:
      if constexpr (Offset + 1 < Limit) {
        return visit<Offset + 1, Limit>(index);
      }
  }

  std::unreachable();
}

While the inline keyword’s more interesting use is to collapse multiple definitions of functions or variables in different translation units into one entity with one address (dcl.inline/6) and therefore sidestep what would otherwise be ODR violations, some compilers actually honor inline as an ignorable optimization hint (dcl.inline/2). For example Clang raises the inlining threshold slightly.

Unfortunately it is not possible to force inlining in a portable way. However, most compilers have special attributes for this very purpose.

CompilerAttribute
GCC[[gnu::always_inline]]
Clang[[clang::always_inline]] or [[gnu::always_inline]]
MSVC[[msvc::forceinline]]

We can use Compiler Explorer’s clang Opt Pipeline Viewer feature to verify all calls to visit were inlined. After the SimplifyCFG optimization pass the chained comparisons are now combined into a switch. If we do not force inlining, switches may already be generated during the earlier InlinerPass pass and increase the calculated inlining cost beyond the threshold, resulting in multiple jump tables.

1
2
3
4
5
6
7
8
9
10
; int visit<0ul, 3ul>(unsigned long)
define weak_odr dso_local noundef i32 @_Z5visitILm0ELm3EEim(i64 noundef %index) local_unnamed_addr #0 comdat {
entry:
  switch i64 %index, label %sw.epilog.i [
    i64 0, label %sw.bb
    i64 1, label %sw.bb.i
    i64 2, label %_Z5visitILm2ELm3EEim.exit
  ]
  ; ...
}

However if we look at the generated assembly, we will notice the lack of a jump table:

1
2
3
4
5
6
int visit<0ul, 3ul>(unsigned long):
        cmp     rdi, 2
        je      int h<2ul>()
        cmp     rdi, 1
        jne     int h<0ul>()
        jmp     int h<1ul>()

Turns out clang decided it’s cheaper to to not generate a jump table for so few branches with x86 as target assembler. This might happen, but at least for x86 we get a jump table with 4 or more cases. This can be verified by looking at the X86 DAG->DAG Instruction Selection pass.

As before after the SimplifyCFG pass it will have combined all inlined comparisons into one switch.

1
2
3
4
5
6
7
8
9
10
11
; int visit<0ul, 4ul>(unsigned long)
define weak_odr dso_local noundef i32 @_Z5visitILm0ELm4EEim(i64 noundef %index) local_unnamed_addr #0 comdat {
  entry:
    switch i64 %index, label %sw.epilog.i [
      i64 0, label %sw.bb
      i64 1, label %sw.bb.i
      i64 2, label %sw.bb.i7
      i64 3, label %_Z5visitILm0ELm4EEim.exit
    ]
  ; ...
}

This time we can see this lovely note in the result of the X86 DAG->DAG Instruction Selection pass:

1
2
3
4
# Machine code for function 
# int visit<0ul, 4ul>(unsigned long): IsSSA, TracksLiveness
Jump Tables:
%jump-table.0: %bb.1 %bb.2 %bb.3 %bb.5

Yay, a jump table! Let’s verify the generated assembly

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
int visit<0ul, 4ul>(unsigned long): 
        lea     rax, [rip + .LJTI0_0]
        movsxd  rcx, dword ptr [rax + 4*rdi]
        add     rcx, rax
        jmp     rcx
.LBB0_1:                           // label %sw.bb
        jmp     int handler<0ul>() // TAILCALL
.LBB0_3:                           // label %sw.bb.i7
        jmp     int handler<2ul>() // TAILCALL
.LBB0_4:                           // label %_Z5visitILm0ELm4EEim.exit
        jmp     int handler<3ul>() // TAILCALL
.LBB0_2:                           // label %sw.bb.i
        jmp     int handler<1ul>() // TAILCALL
.LJTI0_0:
        .long   .LBB0_1-.LJTI0_0
        .long   .LBB0_2-.LJTI0_0
        .long   .LBB0_3-.LJTI0_0
        .long   .LBB0_4-.LJTI0_0

.LJTI0_0 there it is. Magnificent.

Okay, so is this approach sufficient? Unfortunately not. Currently only clang is able to see through this and generate one big jump table.

Fold Tricks

One commenter in the reddit thread I mentioned in the introduction of this blog post proposes the following rather arcane code:

1
2
3
4
5
6
template <int... Is>
int visit(unsigned index) {
  int retval;
  std::initializer_list<int>({(index == Is ? (retval = h<Is>()), 0 : 0)...});
  return retval;
}

Clang is able to see through this, unfortunately GCC and MSVC cannot (Run on Compiler Explorer). To explain why this is the case we need to dig into what GCC does under the hood.

First let’s rewrite the initializer list part to a fold over ,. Clang can still see through this, GCC still cannot.

1
2
3
4
5
6
template <int... Is>
int visit(unsigned index) {
  int retval;
  ((index == Is ? (retval = h<Is>()),0 : 0), ...);
  return retval;
}

Assuming Is is the index sequence [0, 5] this expands to the following code (cppinsights):

1
2
3
4
5
6
7
8
9
10
11
template<>
int visit<0, 1, 2, 3, 4, 5>(unsigned index) {
  int retval;
  (index == 0 ? (retval = h<0>()) , 0 : 0),
    ((index == 1 ? (retval = h<1>()) , 0 : 0), 
      ((index == 2 ? (retval = h<2>()) , 0 : 0) , 
        ((index == 3 ? (retval = h<3>()) , 0 : 0) , 
          ((index == 4 ? (retval = h<4>()) , 0 : 0) , 
            (index == 5 ? (retval = h<5>()) , 0 : 0)))));
  return static_cast<int &&>(retval);
}

We can rewrite this to be a little easier to read:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
int visit(unsigned index) {
  int retval;
  if (index == 0) { // case 0
    retval = h<0>();
  }
  if (index == 1) { // case 1
    retval = h<1>();
  }
  if (index == 2) { // case 2
    retval = h<2>();
  }
  if (index == 3) { // case 3
    retval = h<3>();
  }
  if (index == 4) { // case 4
    retval = h<4>();
  }
  if (index == 5) { // case 5
    retval = h<5>();
  }
  return retval;
}

Let’s look at the first two “cases” just before GCC’s iftoswitch optimization pass. For brevity the following snippet has some pass details removed and identifiers changed to match the code snippet. Additionally there’s a little white lie hidden here - it does not actually write directly to retval just yet, for now it would use a temporary.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
//   basic block 2, loop depth 0, maybe hot
//    pred:       ENTRY (FALLTHRU,EXECUTABLE)
  if (index == 0)
    goto <bb 3>;
  else
    goto <bb 4>;
//    succ:       3 (TRUE_VALUE,EXECUTABLE)
//                4 (FALSE_VALUE,EXECUTABLE)

//   basic block 3, loop depth 0, maybe hot
//    pred:       2 (TRUE_VALUE,EXECUTABLE)
  retval = h<0>();
//    succ:       4 (FALLTHRU,EXECUTABLE)

//   basic block 4, loop depth 0, maybe hot
//    pred:       2 (FALSE_VALUE,EXECUTABLE)
//                3 (FALLTHRU,EXECUTABLE)
  if (index == 1)
    goto <bb 5>;
  else
    goto <bb 6>;
//    succ:       5 (TRUE_VALUE,EXECUTABLE)
//                6 (FALSE_VALUE,EXECUTABLE)

//   basic block 5, loop depth 0, maybe hot
//    pred:       4 (TRUE_VALUE,EXECUTABLE)
  retval = h<1>();

So unfortunately GCC did not infer that only one of the “case” branches will be taken at this point and decides against turning it into a switch. Let’s help GCC out and turn this into

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
int visit(unsigned index) {
  int retval;
  if (index == 0) {      // case 0
    retval = h<0>();
  }
  else if (index == 1) { // case 1
    retval = h<1>();
  }
  else if (index == 2) { // case 2
    retval = h<2>();
  }
  else if (index == 3) { // case 3
    retval = h<3>();
  }
  else if (index == 4) { // case 4
    retval = h<4>();
  }
  else if (index == 5) { // case 5
    retval = h<5>();
  }
  return retval;
}

instead. We can manually confirm that this works. Great - so how do we do this generically?

Short-circuiting folds

We already had a pack of indices before, so let’s use a fold expression again. What we’re looking for is an operator capable of short circuiting.

C++ has two such operators - logical and (expr.log.and/1) and logical or (expr.log.or/1). So let’s fold over or instead.

1
2
3
4
5
6
template <int... Is>
int visit(unsigned index) {
  int value;
  (void)((index == Is ? value = h<Is>(), true : false) || ...);
  return value;
}

According to cppinsights this desugars to:

1
2
3
4
5
6
7
8
9
10
11
12
13
template<>
int visit<0, 1, 2, 3, 4, 5>(unsigned index)
{
  int value;
  static_cast<void>(
    (index == 0 ? (value = h<0>()) , true : false) || 
      ((index == 1 ? (value = h<1>()) , true : false) || 
        ((index == 2 ? (value = h<2>()) , true : false) || 
          ((index == 3 ? (value = h<3>()) , true : false) || 
            ((index == 4 ? (value = h<4>()) , true : false) || 
              (index == 5 ? (value = h<5>()) , true : false))))));
  return static_cast<int &&>(value);
}

As Compiler Explorer can confirm for us: this will convince GCC to generate a jump table. Awesome.

Generic visit

Unfortunately making this generic is a little more difficult.

Since we need to prepare a variable of the visitor’s return type, we need to find the visitor’s return type. Fortunately visitors for all alternatives are required to have the same return type, so we can simply check for the first alternative.

1
2
3
template <typename F, typename V>
using visit_result_t = std::invoke_result_t<F, std::variant_alternative_t<0, 
                                                  std::remove_reference_t<V>>>;

With this we can set up visit.

1
2
3
4
5
6
7
8
template<typename F, typename V>
constexpr auto visit(F&& visitor, V&& variant){
  using return_type = visit_result_t<F, V>;
  constexpr auto size = std::variant_size_v<std::remove_cvref_t<V>>;
  return visit_impl<return_type>(std::make_index_sequence<size>(), 
                                 std::forward<F>(visitor), 
                                 std::forward<V>(variant));
}

Visitors returning void

Since we cannot have variables of type void, we need to special case for visitors that return void. This is rather easy.

1
2
3
4
5
6
7
8
9
10
11
12
13
template <typename R, typename F, typename V, std::size_t... Idx>
  requires (std::same_as<R, void>)
constexpr void visit_impl(std::index_sequence<Idx...>, F&& fnc, V&& variant) {
    auto index = variant.index();
    if(((index == Idx ? std::forward<F>(fnc)(get<Idx>(std::forward<V>(variant))), true 
                      : false) || ...)){
      // visitor found
      return;
    }

    // no visitor found
    std::unreachable();
}

Default-constructibility

Next, let’s make sure we do not introduce a default-constructibility requirement on the return type of the visitor. The idiomatic way to do so is by wrapping it in a std::optional.

However, the fold expression already tells us whether a visitor was found (and therefore if the optional is engaged), we don’t really need optional to begin with.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
template <typename R, typename F, typename V, std::size_t... Idx>
constexpr auto visit_impl(std::index_sequence<Idx...>, F&& fnc, V&& variant) {
  auto index = variant.index();
  union Container {
    ~Container() = default;
    ~Container() requires (not std::is_trivially_destructible_v<R>) {}

    char dummy{};
    R value;
  } ret{};

  if(((index == Idx ? std::construct_at(&ret.value, 
                          std::forward<F>(fnc)(get<Idx>(std::forward<V>(variant)))), true 
                    : false) || ...)){
    // found a visitor -> value is the active member
    return ret.value;
  }

  // no visitor found
  std::unreachable();
}

Move-only return types

Finally, we need to make sure this still works if the visitor returned an object of move-only type.

This can be done by explicitly move-constructing a temporary R object directly in a return statement if we cannot copy-construct.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
template <typename R, typename F, typename V, std::size_t... Idx>
constexpr auto visit_impl(std::index_sequence<Idx...>, F&& fnc, V&& variant) {
  auto index = variant.index();
  union Container {
    ~Container() = default;
    ~Container() requires (not std::is_trivially_destructible_v<R>) {}

    char dummy{};
    R value;
  } ret{};

  if(((index == Idx ? std::construct_at(&ret.value, 
                          std::forward<F>(fnc)(get<Idx>(std::forward<V>(variant)))), true 
                    : false) || ...)){
    // found a visitor -> value is the active member
-    return ret.value;
+    if constexpr (!std::is_copy_constructible_v<R>){
+      return R(std::move(ret.value));  
+    } else {
+      return ret.value;
+    }
  }

  // no visitor found
  std::unreachable();
}

Run the full example on Compiler Explorer.

Expansion statements

So, can we do any better in C++26? Are there any C++26 features that could help with this?

While it’s not yet approved for inclusion in C++26 at the time of writing, one interesting candidate is expansion statements from P1306. I’ve mentioned expansion statements in C++26 Expansion Tricks before, but originally missed its usefulness for this particular problem.

In the spirit of Duff’s device, the obvious approach is to interleave switch and expansion statements.

1
2
3
4
5
6
7
8
9
int visit(int x) {
    switch (x) {
        template for (constexpr auto I : {0, 1, 2, 3, 4}) {
            case I:
                return h<I>();
        }
    }
    return -1;
}

While this compiles with the clang-p2996 fork at the time of writing, this is unfortunately not intended to work.

To explain why, we need to dig into the proposal a little (Thanks to Dan Katz for clarifying this!).

[stmt.expand]/2 from P1306 states (emphasis mine)

The contained statement of an expansion-statement is a control-flow-limited statement ([stmt.label]).

[stmt.label]/3.1 applies to our example:

a case or default label appearing within S shall be associated with a switch statement (stmt.switch) within S

Unfortunately the switch statement in our example is outside of the contained statement S of the expansion statement. We are therefore not allowed to do this.

Optimizer to the rescue

Bummer. However, as we’ve already proven with the fold tricks, clang (and gcc for that matter) are perfectly capable of optimizing multiple branches into a jump table for us.

With the fold tricks from before we needed to prepare a to-be-returned object ahead of time. This is because we cannot return out of a fold expression directly. However, with expansion statements this is not an issue.

This makes it possible to simplify to the following code:

1
2
3
4
5
6
7
8
int visit(int x) {
    template for (constexpr auto I : {0, 1, 2, 3, 4}) {
        if (x == I) {
            return h<I>();
        }
    }
    return -1;
}

We can manually verify with Compiler Explorer that Clang optimizes this as intended.

At the time of writing “Opt Viewer” for clang-p2996 is broken due to a missing Compiler Explorer-specific patch.

To illustrate how this gets optimized, we can instead look at what the expansion statement above would ultimately expand to:

1
2
3
4
5
6
7
8
9
int visit(int x) {
   if (x == 0) { return h<0>(); }
   if (x == 1) { return h<1>(); }
   if (x == 2) { return h<2>(); }
   if (x == 3) { return h<3>(); }
   if (x == 4) { return h<4>(); }
   if (x == 5) { return h<5>(); }
   return -1;
}

View on Compiler Explorer

Generic visit

Implementing a generic single-variant visit is now rather trivial:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
template <typename F, typename V>
constexpr decltype(auto) visit(F&& fnc, V&& variant) {
  static constexpr std::size_t size = std::variant_size_v<std::remove_cvref_t<V>>;
  auto const index = variant.index();
  template for (constexpr auto Idx : std::views::iota{0uz, size}) {
    if (index == Idx) {
      return std::invoke(std::forward<F>(fnc),
                         get<Idx>(std::forward<V>(variant)));
    }
  }

  // replace with proper error handling
  std::unreachable();
}

Run on Compiler Explorer

For a real world example check out rsl/variant.

This post is licensed under CC BY 4.0 by the author.