Tech 101: Making Clang sufficiently dumb

The title might be baffling to some of you; if it is, I advise you read this nice piece which explains the Sufficiently Smart Compiler problem and why it's an issue far better than I ever could. If by the end of this, you're not convinced of why this is a problem (and especially a problem for a language like C), then feel free to ignore this whole article and go about your merry way. However, if you, like me, find this concerning, here is a set of instructions of how you can avoid this problem while using Clang. I will only discuss C here; if you care about C++, write your own blog post.

As an aside: if your compiler of choice is GCC, the approach is a little bit different. I've written some POSIX shell to deal with this problem, so you can just use those. I would appreciate feedback on it though.

The basics

Before I can fully launch into how you can avoid being sufficiently smart with Clang, it's worth discussing how compilers work. We're going to discuss this in three steps: firstly, compilers generally; secondly, C compilers (at least conceptually); and lastly, the LLVM compiler stack specificlaly. If you already understand how a compiler, or a C compiler, typically works, you shouldn't find this too hard; if not, you're in for a fun ride.

Compiler basics

A compiler is nothing more than an automatic translation program, which turns an input in one language (typically called the source language) into an output in a different one (typically called the target language). In the case of C specifically:

The input is source code (in the form of one or more files)
The source language is C (obviously)
The output is a library or executable (depending on what we want)
The target language is usually assembly language for whatever architecture we're currently on (x86_64, ARM, or whatever)

When we talk about languages, we generally care about two things:

What the language looks like, and the rules of its formation (its syntax); and
What the language means (its semantics)

We can also, by extension, talk about syntax and semantics for a given input. Any given input can be either syntactically valid in some language (meaning that it follows the rules of the way the language looks), or syntactically invalid in any language (meaning that it's gibberish). Given that an input is syntactically valid in some language, that input also has semantics in that language (i.e. its meaning). Now, whether this is the meaning you intended or not is not given, but the fact is, syntactically invalid input cannot mean anything, correct or not.

Now, when we do a translation of some input, provided that the input has any meaning at all, we want the translation to have the same meaning. To use our fancy vocabulary above, given syntactically valid input, a compiler must produce syntactically valid output with the same semantics as the input; this fundamentally means that compilation is a syntactic task. At the same time, however, a compiler must ensure several things as part of its operation:

That the input is syntactically valid in the source language (we can't translate gibberish)
That the output is syntactically valid in the target language (non-gibberish never means the same as gibberish)
The semantics of the output must be the same as the semantics of the input (or we're lying to our users)

It'd also be nice if we could also ensure the following things:

That the input has the semantics we intended
That the output is as efficient as possible

Now obviously, these two are difficult if not impossible in the general case: the first one would involve our compiler being literally able to read our minds (at least), while the second is a major reason why the Sufficiently Smart Compiler problem exists in the first place. In theory, these two are not strictly necessary for a compiler to do its work, but in some limited cases, we can still get some of those benefits.

Now that we've gone over what compilers are meant to do, let's look over how they go about doing it, in a general sense. Traditionally, the work of a compiler is divided into five 'phases' of operation:

Lexical analysis (or lexing) - taking input in (what we expect to be) the source language, and turning it into a stream of syntax pieces (or tokens); if we can't, reject and abort
Syntactic analysis (or parsing) - taking the stream of tokens and turning it into a tree representing its semantics (called an abstract syntax tree or AST); if we can't, reject and abort
Semantic analysis - taking an AST and checking that its semantics are remotely what we intended or make any reasonable sense; if we find something that doesn't, reject and abort
Optimization - taking the AST and turning it into an AST with the same semantics, but (ideally) one that can be transformed into more efficient code
Code generation - taking the AST and emitting syntactically valid output in the target language with the same semantics

Now, we can clearly see that the first two phases are about syntax, or rather, making sure that what we've been given is indeed a syntactically valid input in the source language. We need to do this because humans are fallible, and code is extremely specific in its meaning. Once we get past step 3 (the only place where we directly consider semantics), we are no longer dealing with the source language at all, because the AST represents the semantics that we're interested in, in a way that's not dependent on any specific source language. Code generation doesn't really need the thorough checking we do on the input precisely for this reason - we essentially have 'pure semantics', with no possibility of confusion or bad inputs.

Typically, the first two phases are called the 'front end', and the last two are called the 'back end'. Semantic analysis is a bit of an odd one, and can potentially belong in either camp. This separation is not coincidental; by keeping these two separate, we gain multiple benefits: the back end doesn't have to be concerned about the source language; the front end doesn't have to care about the target language; we can port any source language we have a front end for to a new target language by just changing the back end; we can compile many different source languages into the same target language by just changing the front end; and so on. Thus, to nobody's great surprise, most compilers these days take exactly this approach.

Compiling C

The C language in particular adds a few extra moments to the above description, although still being broadly the same. More exactly, C has three additional concepts: preprocessing, separate compilation and translation units, and linking, all of which we will describe (again, in a general way) next. Of these, preprocessing is the simplest: it is essentially a simple copy-paste operation performed on C source files before any of the code is fed to the compiler proper.

With respect to separate compilation, C allows you to divide up your program into translation units, which you can think of as C source files. Each of these is compiled separately, and only combined together at the end. This doesn't really change the five-phase description above; it just means that it runs many times over your program, not all in one go. It also means each translation unit undergoes each process separately - in fact, this is one reason why header files are necessary, as without them, we wouldn't easily have enough information to make many of the checks required. It also limits some of the work of the optimization pass due to having only per-unit information (although in practice, all modern compilers have ways to avoid this problem).

Now, once we've compiled all the translation units (separately), we need some way to combine them into a single output (the exact nature of which might vary depending on what we want). This step, called linking, has several responsibilities:

If a translation unit calls a function in another translation unit, these calls must be routed appropriately
If we use any libraries, we have to make sure that they're either included into our output, or set up some kind of runtime resolution
Depending on what output we want, set up the appropriate scaffolding (such as an entry point for an executable)

It's worth noting that, at this point, all the code has already been compiled - when we link, no more compilation takes place. In particular, it means that the linker (the program which does all the above) isn't aware of anything to do with the source language anymore; it operates entirely on the target language. Now, modern compilers don't really work this way exactly, but once again, keeping that mental model can help understand much of what's to come (and a bunch of other things in C besides, which are written with exactly this model in mind).

Thus, in short, your C codebase undergoes the following steps:

For each translation unit, apply whatever copy-paste operations required by their preprocessor directives
Compile each translation unit, using the five-phase model described previously
Link everything together

How Clang and LLVM do it

Now that all that is over, we can talk about how Clang and LLVM do this. Essentially, LLVM makes a slight difference to the five-phase model above, and adjusts the C compiler model a little to accomodate this. After C code passes through its front end, Clang emit LLVM bytecode, which can be thought of as assembly language for some non-existent 'LLVM machine'. No optimization pass is done here, and the actual target language is not generated: in a sense, LLVM bytecode is a representation of the AST we described before. After this, optimization passes are run on the LLVM bytecode, which serve the role of the optimization phase described above; each pass does one kind of optimization, and emits optimized LLVM bytecode. Then, we convert the LLVM bytecode into assembly language for the actual target language (although we can execute LLVM bytecode directly if we really want).

Thus, our model of a C compiler gets modified a little:

For each translation unit, apply whatever copy-paste operations are required by their preprocessor directive
For each translation unit, go through the front end; if all goes well, emit LLVM bytecode
For each translation unit, for each optimization pass, feed the LLVM bytecode through it and collect the output
For each translation unit, feed the LLVM bytecode through the code generator to emit the target language
Link everything together

This model is quite helpful, because thanks to the way Clang and LLVM are designed, we can 'intercept' the results from any of these steps, and change them to suit ourselves. In particular, step 3 can give us a lot of control by simply choosing which optimization passes to run or not to run. Additionally, we can inspect the LLVM bytecode at each stage to see what effect we're getting (or not getting) and whether we actually want this or not. Considering that C encourages small, tight translation units, this works in our favour, because we can easily check what our compiler and optimizer are emitting, and take appropriate actions if not. This is unlike the more monolithic model of GCC, where such interception is basically not possible. With this, we can address the problem of sufficient smartness appropriately.

Making it all work

"I get it Koz, but I didn't come here for a lecture. Show me how to fix the Sufficiently Smart Compiler problem already!"

Consider the (rather daft) C99 code below, in a file called main.c:

#include <stdlib.h>

static int loop (int x) {
  if (x) {
    loop(x - 1);
  }
  return x;
}

int main (void) {
  return loop(4);
}

The smart cookies in the audience will recognize that the loop function above can be trivially turned into a proper tail call. Let's set ourselves up so that we can have this optimization, but not any others.

We first have to get Clang to spew out LLVM bytecode. We do this by invoking the following command:

clang -emit-llvm -c main.c

Feel free to add whatever warning, standard etc. options you want here (which you should always do anyway). If all goes well, this will do all the preprocessing and front-end work, and dump an LLVM bytecode file (called main.bc) in the same directory. This bytecode file is not meant to be easily read by people; if we want to inspect it, we need to convert it into a readable bytecode file. To do that, we use llvm-dis as follows:

llvm-dis main.bc

This will emit a main.ll file, which we can have a look at with an ordinary text editor (or less if you so prefer). If you can read assembly language (of any sort), you'll feel right at home here. If not, don't worry - it's not hard, and I won't make you read it this time. However, even a cursory examination will show us that we don't have a proper tail call here, because of this:

; Function Attrs: noinline nounwind uwtable
define internal i32 @loop(i32) #0 {
; some stuff we can live without
  br i1 %4, label %5, label %9

; <label>:5:                                      ; preds = %1
; more stuff that really doesn't matter
  %8 = call i32 @loop(i32 %7)
  br label %9

; <label>:9:                                      ; preds = %5, %1
  ret i32 0
}

Clearly, we end up calling loop over and over, which will pile up on the call stack. Now, when we call loop with an argument of 4, this is a non-issue, but if we were to call it with a bigger value, it could easily overflow the call stack. So let's fix it!

The program LLVM uses to perform optimizations is opt. It can perform a large range of different optimizations - a full list is here - but the one we're most interested in is this one:

-tailcallelim: Tail Call Elimination

This file transforms calls of the current function (self recursion) followed by a return instruction with a branch to the entry of the function, creating a loop.

To apply it, we need to invoke opt on our LLVM bytecode (that's the .bc file, not the .ll file), with -tailcallelim as a flag. By default, opt will dump the result to the standard output, so we need to tell it to put the result in a file called main.opt.bc:

opt -tailcallelim -o=main.opt.bc main.bc

Let's have a look at the result, after running it through llvm-dis:

; Function Attrs: noinline nounwind uwtable
define internal i32 @loop(i32) #0 {
; some stuff we don't care about
  br label %tailrecurse

tailrecurse:                                      ; preds = %5, %1
; more stuff that doesn't really matter
  br i1 %4, label %5, label %8

; <label>:5:                                      ; preds = %tailrecurse
  %6 = load i32, i32* %2, align 4
  %7 = sub nsw i32 %6, 1
  br label %tailrecurse

; <label>:8:                                      ; preds = %tailrecurse
  ret i32 0
}

Much better! Since we now don't have calls piling up, our code will be as fast as a loop. Because we're happy with what we have, it's time to turn our LLVM bytecode into assembly language, using llc (which is the LLVM bytecode 'assembler'). We have to specify that we want an object file, rather than textual assembly - if you want to see what it assembles into, change the filetype= option to asm:

llc -filetype=obj -o main.o main.opt.bc

Now, we can link together our program. While we could do this using the linker directly, it's more convenient for us to use clang again:

clang -o main main.o

And there we have it - one beautiful binary, with only those optimizations we expressly asked for!

What optimizations are 'safe'?

The number of different optimization passes opt can do on your code, and the associated documentation, could choke a whale. While I can't really give better advice than 'read the documentation and keep your problem in mind', the following passes are pretty predictable and safe to do 'by default': dead code or store elimination and proper tail calls. In the future, I might write some examples or further stuff on this, though, so stay tuned!

"But Koz, all those commands are a real pain! What could I do to make it hurt less?"

Glad you asked! Check out tup; you won't be sorry.

Conclusion and going further

Whew, what a post! We somewhat-casually covered some of the internal workings of Clang and LLVM, compilers and the C compilation process, as well as ways to avoid the Sufficiently Smart Compiler problem. In future posts, we'll go into how we can do even better than this (this simple code is nearly a 10K binary, which is far too big).