I explain a bit about the language standards and compilers first. Feel free to jump to the examples.
I like C++. I first learned it in mid 2015 and have been using it actively ever since for various purposes, including at multiple big firms. At this point, I’m not surprised by the amount of learning and surprises to expect from C++.
It provides a lot of freedom, perhaps too much for its own good. That makes it dangerously easy to write bad code or hack around problems. But that’s a topic for another blog post.
I like to keep myself up to date with C++ language standard and new features by reading cppreference and sometimes ISO C++ proposals. However, the compilers are another thing entirely. They rapidly evolve and have their own sets of specifications and quirks.
The language standard acts like a contract that defines what happens when given
code is written while leaving the compiler free to do whatever it wants to
achieve that, as long as the observable behavior matches. For instance, signed
integer overflow is undefined behavior: if int x = INT_MAX; ++x; occurs, the
standard says nothing about what happens next; one compiler may wrap around,
another may optimize the code away entirely, and both would still be conforming.
One of the most important roles of the compiler is, thus, optimization. As long as the resulting program behaves as the written code specifies, the compiler can rearrange, remove, or add instructions to make it faster or smaller. You’d be surprised how far it goes.
Over time, C++ compilers have ended up with a list of so-called compiler optimizations that they apply to a program. For example, inlining functions means the compiler might replace a function call with the actual body of the function to avoid the overhead of the call. Loop unrolling means the compiler might duplicate the body of a loop multiple times to reduce the number of iterations and branching. And so on.
To make it easier for users to apply these optimizations, most compilers provide
optimization flags like -finline-functions or -funroll-loops. I will use GCC
and Clang as examples, but other compilers have similar flags. And these
compilers also provide optimization presets like -O0, -O1, -O2, -O3, and
-Ofast that enable a set of optimizations at once. There are also flags like
-Os and -Oz that optimize for size instead of speed.
Note: These shortcut flags are not pure presets and might end up doing more than just enabling individual flags. To quote from GCC documentation: “Not all optimizations are controlled directly by a flag.”
The higher the optimization level, the more optimizations are applied. -O0 is
the default and applies no optimizations, while -O3 applies the most
optimizations. -Ofast goes even further by enabling optimizations that might
break strict compliance with the language standard, such as assuming that
floating-point math is associative. I recommend skimming through the GCC
optimization options
here to get a sense
of the variety of optimizations available.
Now, naturally, one would expect that higher optimization levels would always
lead to better performance. And in almost all cases, they do. However, there are
some surprising cases more optimizations might actually lead to worse
performance. Even Red Hat and Fedora used to compile Python with O2 for ~1%
speedup
(source).
Below, I will show a couple examples where O3 performs worse than O2 in
modern compilers (as of 2025), and dig into the reasons behind it as a learning
experience. An older famous example is the
bubble sort benchmark from 10
years ago, but that’s already fixed in modern compilers.
All code is compiled and run using gcc (GCC) 15.2.1 and/or
clang version 21.1.4 on my 6.17.1-arch1-1 x86_64 system. No battery-saving
or efficiency modes are enabled. Feel free to try the examples on your own
system and compiler installation.
Example 1: Recursive Fibonacci
Consider the following code:
#include <iostream>
int fib(int n) {
if (n <= 1) {
return n;
}
return fib(n - 2) + fib(n - 1);
}
int main() {
std::cout << fib(42) << "\n";
return 0;
}
Benchmark command:
for o in 0 1 2 3; do g++ fibo.cpp -O$o -o exe_fibo$o; done
hyperfine ./exe_fibo*
Results:
Benchmark 1: ./exe_fibo0
Time (mean ± σ): 1.537 s ± 0.002 s [User: 1.532 s, System: 0.001 s]
Range (min … max): 1.535 s … 1.540 s 10 runs
Benchmark 2: ./exe_fibo1
Time (mean ± σ): 1.033 s ± 0.006 s [User: 1.030 s, System: 0.001 s]
Range (min … max): 1.023 s … 1.040 s 10 runs
Benchmark 3: ./exe_fibo2
Time (mean ± σ): 335.0 ms ± 4.2 ms [User: 333.0 ms, System: 1.2 ms]
Range (min … max): 328.0 ms … 343.5 ms 10 runs
Benchmark 4: ./exe_fibo3
Time (mean ± σ): 378.6 ms ± 5.1 ms [User: 376.8 ms, System: 0.9 ms]
Range (min … max): 372.5 ms … 387.8 ms 10 runs
Summary
./exe_fibo2 ran
1.13 ± 0.02 times faster than ./exe_fibo3
3.08 ± 0.04 times faster than ./exe_fibo1
4.59 ± 0.06 times faster than ./exe_fibo0
As we can see, -O3 performs a significant 13% worse than -O2. Why is that?
There are usually two ways to investigate such performance surprises: looking at the generated assembly code or selectively disabling O3-only flags to see which ones cause the slowdown. For such small example, looking at perf reports wouldn’t be useful as there aren’t many instructions to analyze.
Assembly Analysis
Looking at the assembly code, we can see that
-O3 version has 279 instructions while -O2 has only 247. Number of
instructions is not necessarily a good indicator of performance, but it means
there’s more going on to investigate. We also see that the only diff happens
inside main function.
While -O2 simply calls fib(42), -O3 does something pretty impressive. It
runs the following sum instead and adds 1 at the end:
$$\sum_{i=0}^{40} \text{fib}(i) $$
Why that works is, we have an identity that states: $ \text{fib}(n) = 1 + \sum_{i=0}^{n-2} \text{fib}(i) $ for $ n \geq 2 $.
This means there must be some optimization in -O3 that analyzes the recursion
and derives this formula to “speed up” the computation.
Flag Analysis
It’s certainly possible that the difference is caused by multiple flags working
together, or an implicitly enabled optimization (as mentioned earlier, -O3 is
not just a preset of flags and might do a bit more). But we can still try our
chances with single flags.
I coded a
quick script
to compile the code with -O2 and -O3 as baselines, and then added each -O3
flag one by one on top of -O2 to see if any of them causes a slowdown. The
script also checks generated assembly avoid unnecessary benchmarking when the
assembly is identical to either baseline.
$ python find_culprit.py fibo.cpp
Analyzing optimization flags for: fibo.cpp
Found 13 additional flags in O3 compared to O2
Flags: -fgcse-after-reload, -fipa-cp-clone, -floop-interchange, -floop-unroll-and-jam, -fpeel-loops, -fpredictive-commoning, -fsplit-loops, -fsplit-paths, -ftree-loop-distribution, -ftree-partial-pre, -funroll-completely-grow-size, -funswitch-loops, -fversion-loops-for-strides
Generating baseline assemblies...
Analyzing flags by assembly comparison...
Results:
Same assembly with O2: -fgcse-after-reload, -floop-interchange, -floop-unroll-and-jam, -fpeel-loops, -fpredictive-commoning, -fsplit-loops, -fsplit-paths, -ftree-loop-distribution, -ftree-partial-pre, -funroll-completely-grow-size, -funswitch-loops, -fversion-loops-for-strides
Same assembly with O3: None
Different assembly: -fipa-cp-clone
Compiling 3 binaries for benchmarking...
Running benchmarks...
Benchmark 1: O2
Time (mean ± σ): 283.9 ms ± 11.3 ms [User: 282.1 ms, System: 1.0 ms]
Range (min … max): 263.6 ms … 294.5 ms 11 runs
Benchmark 2: O3
Time (mean ± σ): 324.8 ms ± 15.8 ms [User: 322.9 ms, System: 1.0 ms]
Range (min … max): 306.5 ms … 344.8 ms 10 runs
Benchmark 3: O2+-fipa-cp-clone
Time (mean ± σ): 301.1 ms ± 15.5 ms [User: 298.7 ms, System: 1.4 ms]
Range (min … max): 284.0 ms … 323.0 ms 10 runs
Summary
O2 ran
1.06 ± 0.07 times faster than O2+-fipa-cp-clone
1.14 ± 0.07 times faster than O3
So we do notice that adding -fipa-cp-clone on top of -O2 causes a slowdown,
and looking at the new assembly diff, we can
confirm that the a derived formula for Fibonacci is used as well. Ironically, O3
is still slower than O2 with just this flag added, meaning the other added
optimizations still hurt.
What is -fipa-cp-clone?
To quote GCC documentation:
-fipa-cp-clonePerform function cloning to make interprocedural constant propagation stronger. When enabled, interprocedural constant propagation performs function cloning when externally visible function can be called with constant arguments. Because this optimization can create multiple copies of functions, it may significantly increase code size (see —param ipa-cp-unit-growth=value).
In simpler terms, if a function foo(int mode) is usually called with
mode = 1, the compiler using -fipa-cp-clone may create a specialised
version, e.g. foo_const_1(), where mode is treated as a constant. This allows
the compiler to remove unnecessary branches and optimise the code inside that
version for faster execution.
Our Fibonacci function indeed is called with constant values. However, don’t be
fooled into thinking our performance hurts only because we use 42 as a
constant input. If you take a look at a
version where we feed n from stdin, O3
version is still doing the sum trick and is still slower than O2.
It seems that the compiler is smart enough to simulate the fib function with
constant inputs and derive the sum formula using some recursion analysis.
-fipa-cp-clone is not the sole responsible, but it seems to be the trigger
that makes the compiler do this extra work.
I’m certainly not an expert in compilers, so the above is my best guess. If you have more insights, please send me an email!
Order of recursion calls
Before moving on, one last interesting observation. Changing the return value
from fib(n-2) + fib(n-1) to fib(n-1) + fib(n-2) changes the performance as
well.
With this change, O3 gets a bit faster but O2 gets a bit slower. The
previous O2 is still the winner overall. And O3 here still uses the sum
trick but in a slightly different way. I will leave this as an exercise to the
reader to investigate!
Example 2: Tricking the compiler
In this example, I am going to play a little bit dirty. Consider the following code:
#include <cstdint>
#include <iostream>
#include <vector>
uint32_t compute_checksum(const uint8_t* data, size_t len) {
uint32_t sum = 0;
for (size_t i = 0; i < len; i++) {
sum += data[i];
}
return sum;
}
int main() {
const size_t SIZE = 1024 * 1024 * 10; // 10MB
const int ITERATIONS = 51;
std::vector<uint8_t> data(SIZE);
for (size_t i = 0; i < SIZE; i++) {
data[i] = (i * 137 + 41) % 256; // Pseudo-random pattern
}
uint32_t result = 0;
for (int iter = 0; iter < ITERATIONS; iter++) {
result ^= compute_checksum(data.data(), SIZE);
}
std::cout << "Checksum: " << std::hex << result << std::dec << std::endl;
return 0;
}
Here I’m actually doing xor over the same checksum. But xor has a property
that a ^ a = 0, so the final result is always a single call to
compute_checksum as 51 is an odd number.
For the example above, neither GCC nor Clang uses that property and recalculates
the result for every iteration. As a result, both -O2 and -O3 versions
perform similarly, even though Clang is significantly faster than GCC. Here are
the benchmark results:
Benchmark 1: ./exe_gcc_checksum_O2
Time (mean ± σ): 76.1 ms ± 3.6 ms [User: 71.4 ms, System: 4.3 ms]
Range (min … max): 69.1 ms … 82.6 ms 39 runs
Benchmark 2: ./exe_gcc_checksum_O3
Time (mean ± σ): 77.6 ms ± 4.4 ms [User: 72.7 ms, System: 4.4 ms]
Range (min … max): 70.6 ms … 86.2 ms 41 runs
Benchmark 3: ./exe_clang_checksum_O2
Time (mean ± σ): 30.3 ms ± 5.1 ms [User: 25.7 ms, System: 4.2 ms]
Range (min … max): 20.7 ms … 41.3 ms 100 runs
Benchmark 4: ./exe_clang_checksum_O3
Time (mean ± σ): 30.3 ms ± 4.3 ms [User: 25.9 ms, System: 4.1 ms]
Range (min … max): 22.1 ms … 40.9 ms 93 runs
But above checksum is pretty weak. Let’s write a more realistic version:
uint32_t compute_checksum(const uint8_t* data, size_t len) {
uint32_t sum1 = 1;
uint32_t sum2 = 0;
for (size_t i = 0; i < len; i++) {
uint8_t byte = data[i];
if (byte < 32) {
sum1 = (sum1 + byte) % 65521;
} else if (byte < 64) {
sum1 = (sum1 + byte * 2) % 65521;
} else if (byte < 128) {
sum1 = (sum1 + byte * 3) % 65521;
} else {
sum1 = (sum1 + byte * 4) % 65521;
}
sum2 = (sum2 + sum1) % 65521;
if (i % 8 == 0) {
sum1 ^= (sum2 << 3);
}
if (i % 16 == 0) {
sum2 ^= (sum1 >> 2);
}
}
return (sum2 << 16) | sum1;
}
Things get very interesting now. Here is the benchmark results:
Benchmark 1: ./exe_gcc_checksum_O2
Time (mean ± σ): 60.5 ms ± 1.6 ms [User: 55.8 ms, System: 4.3 ms]
Range (min … max): 58.3 ms … 66.6 ms 50 runs
Benchmark 2: ./exe_gcc_checksum_O3
Time (mean ± σ): 1.354 s ± 0.049 s [User: 1.346 s, System: 0.005 s]
Range (min … max): 1.306 s … 1.417 s 10 runs
Benchmark 3: ./exe_clang_checksum_O2
Time (mean ± σ): 1.846 s ± 0.030 s [User: 1.836 s, System: 0.005 s]
Range (min … max): 1.797 s … 1.891 s 10 runs
Benchmark 4: ./exe_clang_checksum_O3
Time (mean ± σ): 1.876 s ± 0.054 s [User: 1.865 s, System: 0.005 s]
Range (min … max): 1.800 s … 1.967 s 10 runs
Tables have turned in a weird direction. Now GCC O2 is insanely fast, and
overall GCC is faster than Clang. What’s going on here?
Assembly & Flag Analysis
I will leave this to the interested readers as an exercise, so if you want to investigate by yourself first, stop reading here.
Explanation
This one is actually simpler than the previous example. All compilers have the
capability to figure out that the final result is just a single checksum
computation. However, when the function is inlined, the compiler naturally loses
track of that as it doesn’t see a simple result ^= compute_checksum(...) call
anymore. Instead, it tries to vectorize and optimize unrolled loops over the
large function body.
We can notice that Clang is more aggressive with inlining functions than GCC as
GCC O2 refused to inline compute_checksum for the larger function. Adding
O3 makes inlining more aggressive for all compilers, so both GCC O3 and
Clang O3 end up inlining the function and doing full computation every time.
To confirm this, we can force-disable inlining by adding
__attribute__((noinline)) to compute_checksum function, and now everything
is fast again:
Benchmark 1: ./exe_gcc_checksum_O2
Time (mean ± σ): 8.1 ms ± 1.8 ms [User: 3.7 ms, System: 4.3 ms]
Range (min … max): 5.4 ms … 14.3 ms 360 runs
Benchmark 2: ./exe_gcc_checksum_O3
Time (mean ± σ): 8.0 ms ± 1.5 ms [User: 3.6 ms, System: 4.3 ms]
Range (min … max): 5.3 ms … 14.6 ms 382 runs
Benchmark 3: ./exe_clang_checksum_O2
Time (mean ± σ): 6.5 ms ± 1.1 ms [User: 2.2 ms, System: 4.1 ms]
Range (min … max): 4.3 ms … 9.8 ms 384 runs
Benchmark 4: ./exe_clang_checksum_O3
Time (mean ± σ): 6.5 ms ± 1.1 ms [User: 2.1 ms, System: 4.2 ms]
Range (min … max): 4.4 ms … 10.1 ms 270 runs
Takeaways
The examples I provide above are simply interesting curiosities. In real-world code, such extreme cases are rare. However, they do highlight the complexity of modern compilers and the importance of understanding how optimization levels can affect performance in unexpected ways.
In practice, it’s always a good idea to benchmark your code with different optimization levels and use Profile Guided Optimization (PGO) for critical paths. PGO allows the compiler to optimize based on actual runtime behavior, which can lead to better performance than static optimizations alone.