Exploring fast-math in Rust: Part 0 - Introduction

Suppose I want to sum a bunch of floating-point numbers. In C that would probably be:

float summation(float* x, size_t len) {
    float sum = 1.0;
    for (size_t i = 0; i < len; i++) {
        sum += x[i];
    }
    return sum;
}

Except this code did not just sum the numbers, it summed them in order. Floats do not behave the same way as real numbers. When I typed the code for summation, the compiler assumes I’m aware that floating-point addition is not associative and I indeed want the summation to be done sequentially from the first one to the last.

The -ffast-math flag allows the compiler to relax the rules a bit which can allow it to do more aggressive optimizations. Like in the summation example here - by allowing the compiler to assume that float add is associative, instead of summing the numbers one by one, the compiler might decide that it’s faster to:

  • Sum the elements of x four at a time with SIMD
  • Sum the four values of the resulting vector
  • Then add in the 1 - 3 elements that remain (if any).1

And we want this in Rust. The mechanism is already in the LLVM backend, we just need to take advantage of it.

State of things

Some might note that we already do have fast-math available in Rust-nightly in the form of functions in std::intrinsics:

// These should only be called on f32/f64 even though the type bound does not
// reflect this. Thankfully, rustc would throw an error if you try to call
// these with integer types.
pub fn fadd_fast<T: Copy>(a: T, b: T) -> T;
pub fn fsub_fast<T: Copy>(a: T, b: T) -> T;
pub fn fmul_fast<T: Copy>(a: T, b: T) -> T;
pub fn fdiv_fast<T: Copy>(a: T, b: T) -> T;
pub fn frem_fast<T: Copy>(a: T, b: T) -> T;

I can think of three ways of having fast-math in Rust:

  1. Create a FastFloat type that has fast-math instructions internally.
  2. Imitate what Clang does and add it via a flag.
  3. Add a #[fast_math] attribute to apply it locally to functions/statements.

These gravitate towards the first one. And, yes, one could express a good deal of fast-math ops with just these. The code in the intro can be done with fadd_fast. But I would like to point out that these are just half of the fast-math abled operations in LLVM. The language reference lists the floating-point ops that may have fast-math flags as fneg, fadd, fsub, fmul, fdiv, frem, fcmp, phi, select and call.

About adding the missing ones in the same way:

  • fneg seems straightforward to do.
  • fcmp should be fine to expose. The comparison type parameter could be an enum.
  • call itself is not what is needed but the instrinsics that have approximate variants ( log, sqrt, exp,…). Code repetitiveness aside, these instrinsics could each have _fast functions added without much trouble.
  • phi and select appear after the compiler is done parsing the code, so these instructions are only available during LLVM IR codegen. Exposing these two via a library function does not seem feasible.2

A minor nitpick: Fast-math is a combination of 7 different flags and I think fadd_fast, et al. should take a parameter for the flags we want to enable. This can help the compiler do delicious optimizations without it doing reduced precision divides and sqrt’s and/or without assuming something potentially unsafe like NaNs not existing.

// defined somewhere
bitflags! {
    #[derive(Default, Encodable, Decodable)]
    pub struct FastMathFlags: u8 {
        ...
    }
}

pub fn fadd_fast<T: Copy>(a: T, b: T, flags: FastMathFlags) -> T;

Explosion of functions

Something more problematic with the current approach is that it won’t scale well. We also probably want fast-math for other types besides just f32 and f64; if we look at what LLVM does, we see that fast-math call’s also applies to float vectors:

// excerpt from llvm/IR/Operator.h
case Instruction::Call: {
    Type *Ty = V->getType();
    while (ArrayType *ArrTy = dyn_cast<ArrayType>(Ty)) {
        Ty = ArrTy->getElementType();
    }
    return Ty->isFPOrFPVectorTy();
}

Vector types are what LLVM uses to represent SIMD types in the IR. While researching about this topic, I found an example on stackoverflow where -ffast-math modifies the result of SSE intrinsics.3 If we care about raw performance, we might want to also support fast-math on SIMDs.

Consider _mm_add_ps - an SSE intrinsic that adds four 32-bit floats contained in a __m128d. In Rust, this is internally a call to simd_add, which is just fadd - meaning this could be modified by a fast-math flag. An _mm_add_ps_fast is one thing but we have multitudes of intrinsics for multitudes of architectures; a _fast version for each one of them is certainly doable but doesn’t smell like good design to me.

Conclusion

Let me just say that I am by no way against a FastFloat type. I just think a proper one would need more assistance from the compiler and not rely on the _fast functions. That probably needs some changes in the Rust HIR and/or MIR and cannot be done with just the information available to the backend LLVM IR.

Since implementing a fast-math type right now would not fully capture what LLVM is capable of, I’m going to explore in this series of posts the other two options - fast-math via a flag and an attribute. They look easier to implement. The next part would be adding a -Z flag.


  1. And that is actually what Clang produces - add four at a time with fadd fast <4 x float> then horizontally sum the SIMD vector with @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32. ↩︎

  2. Might not matter much. LLVM is smart enough to derive phi and select from just br’s. And while there are a couple of usages of select in the LLVM backend code, I only found a total of two places where phi was used. ↩︎

  3. This is actually fcmp, with Clang translating _mm_cmpord_ps directly to LLVM IR. Rust calls the llvm.x86.sse.cmp.ps intrinsic instead. ↩︎