On `union` in C

#c #learning #union #lowlevel

This is a response I wrote several months ago on /r/C_Programming that I wrote to the question "Can someone explain unions?" No one ever read it, as far as I can tell, because /r/C_Programming is basically abandoned and most people ignore that kind of "I didn't know what I was getting into when I signed up for a college course that uses C!" cry for help.

I thought the question was pretty interesting, though, because in an age where constructs like Rust's enum; Haskell's sum types; and object-oriented subclassing in Java, C# and C++ are available as higher-level and relatively narrow constructs for polymorphism, it's not always immediately obvious how unions are (ab)used in C.

There are two common use cases for unions. One is for when you want to store a value that might be one of several different types, which is called a tagged union. For a common example, consider a type Number which can represent either an int or a float:

#include <stdio.h>

enum NumberKind { FLOAT, INT };
struct Number {
  enum NumberKind kind;
  union {
    int i;
    float f;
  };
};

void output_number(struct Number * n) {
  switch (n->kind) {
  case INT:
     printf("The integer %d\n", n->i);
     break;
  case FLOAT:
    printf("The float %f\n", n->f);
    break;
  }
}

int main() {
  struct Number three = { .kind = INT, { .i = 3 } };
  struct Number two_point_five = { .kind = FLOAT, { .f = 2.5 } };

  output_number(&three);
  output_number(&two_point_five);

  return 0;
}

The union section of struct Number will hold either an int i or a float f, and the enum NumberKind kind field tells the programmer which it is. The implementations of many interpreted languages use tagged unions (often with fascinating optimizations, but that's a story for another time) to represent values --- that's why you don't have to write types like int, float, or struct Number in Javascript.

As with most things in C, there's a lot of room for mistakes; there's nothing stopping me from reading n->i regardless of what n->kind is, and believe me when I say that I and every other C programmer have spent more hours than we'd like debugging exactly that.

That ability, though, is vital to the other use of unions, type punning. Wikipedia has a great article on type punning, which it defines as "any programming technique that subverts or circumvents the type system of a programming language in order to achieve an effect that would be difficult or impossible to achieve within the bounds of the formal language."

That's kind of a mouthful, but the point is that, at a certain point, you and the type system are going to agree, and type punning is the way you cheat. Rust has it in the standard library, but in C we're limited to either pointer-casts or unions.

Consider code for pointer-tagging.

#include <stdint.h>
#include <stdbool.h>
#include <assert.h>
#include <stdlib.h>

union TaggedPointer {
  void * p;
  uintptr_t i;
};

// set the least significant bit of `ptr`
void * tag_a_pointer(void * ptr) {
  union TaggedPointer tagged = { .p = ptr };
  tagged.i |= 1;
  return tagged.p;
}

// true iff the least significant bit of `ptr` is set
bool has_low_bit_set(void * ptr) {
  union TaggedPointer tagged = { .p = ptr };
  return (tagged.i & 1) == 1;
}

int main() {
  void * some_pointer = malloc(16);

  assert(!has_low_bit_set(some_pointer));

  void * tagged = tag_a_pointer(some_pointer);
  assert(has_low_bit_set(tagged));

  return 0;
}

If you're not familiar, the idea behind pointer tagging is that the low few bits of a pointer go unused because of some weird crap, and so they can be used the same way as the tag in a tagged union.

To do that, though, you need to be able to operate on the bits of the pointer as if they were an integer. Because TaggedPointer.i and TaggedPointer.p occupy the same block of memory, altering one (as in tag_a_pointer) changes the other. Many people thing this looks cleaner than a pointer-cast, which might look like:

void * tag_a_pointer_but_with_pointer_casts(void * ptr) {
  uintptr_t i = *((uintptr_t *)(&ptr));
  i |= 1;
  return *((void **)(&i));
}

Aside: The reason my two examples in this post are similar is not because unions are limited to use in tagging for interpreted languages, but just because my hobbyist work is limited to that field. Both patterns pervade the language --- I first encountered tagged unions while digging through the C source for the Xen hypervisor (a pretty brutal first exposure to C, but I was just an intern, so it didn't really matter). The first example I ever saw of type punning was FastInverseSquareRoot, which is an adventure in low-level programming unto itself.

Top comments (3)

wiz • Dec 3 '18

Even I use to question the use of union and my professor gave me this scenario: when we have a network port and we don't know the type of data incoming, then we use a union.

typedef union _Packet {
  int iData;
  double dData;
  char cData;
}Packet;

Phoebe Goldman • Dec 3 '18

This snippet isn't complete on its own --- you need some way to tell whether a Packet is going to be an int, double, or char. Consider:

#include <stdio.h>

typedef union Packet {
    int iData;
    double dData;
    char cData;
} Packet;

extern Packet read_next_packet();

int main() {
    Packet received = read_next_packet();
    printf("`received.iData` is %d\n", received.iData);
    printf("`received.dData` is %f\n", received.dData);
    printf("`received.cData` is %c\n", received.cData);
}

All three printf statements will run, and two of the three will produce undefined behavior. What happens if received is a char and I try to access received.dData? What value do the other 7 bytes of the 8-byte double take?

wiz • Dec 4 '18

Yeah, you are absolutely right.The snippet is not complete and also buggy if implemented but there must be some support code to handle such issues. I don't know that now. What I wanted to tell is just the use of the union.
When you have, say 10 different types of data coming and you know very well that at a time only one type of data will be stored. So there is no point in using struct since it will consume a lot of memory. Considering cases like such the notion of the union was introduced in C because we also not had much memory that time.

DEV Community

On `union` in C

Top comments (3)

Read next

Learn Rust in 3 Months

How Generative AI Works

Recreating strlen and strcmp in Assembly: A Step-by-Step Guide

Best solution to AEC by porting matlab/octave algorithm to C