Skip to content

Performance annotations

just-makeit ships a set of compiler-hint macros and a dispatch pattern that can significantly accelerate hot DSP loops. Performance is opt-in: plain projects build and run correctly without any of it.

Enabling perf

New project:

just-makeit new my_filter --object filter --perf

Existing project (preserves all user code):

just-makeit perf

Once perf = true is recorded in just-makeit.toml, every subsequent init and add inherits it automatically.


jm_perf.h — compiler-hint macros

Macro Effect
JM_FORCEINLINE Forces inlining; eliminates call overhead on hot functions.
JM_HOT Marks a function as performance-critical.
JM_LIKELY(x) Hints that x is almost always true.
JM_UNLIKELY(x) Hints that x is almost never true.
JM_RESTRICT Asserts no pointer aliasing; enables freer vectorisation.
JM_ALIGNED(n) Aligns a variable or struct field to n bytes.

All macros expand to safe no-ops on unknown compilers. On x86, jm_perf.h also includes <immintrin.h> so SIMD intrinsics are available without an extra include.


JM_DEFINE_STEPS — the dispatch macro

JM_DEFINE_STEPS stamps out <fn>_steps() — the outer dispatch loop — so you never write it by hand.

Generic form

JM_DEFINE_STEPS(fn, state_t, sample_t, LENGTH, BATCH, CHUNK)
Parameter Concern Meaning
fn Name prefix; resolves fn##_step() and fn##_step_batch()
state_t State struct type
sample_t Per-sample type (e.g. float complex)
LENGTH algorithm History depth: samples held in state->delay[]
BATCH parallelism SIMD width in samples
CHUNK tuning Samples per scratch-buffer fill

LENGTH, BATCH, and CHUNK must be integer constant expressions (no VLA).

What gets generated:

  • On AVX-512: fills a stack-resident scratch buffer (L1-resident, LENGTH + CHUNK samples), calls fn##_step_batch() in strides of BATCH, then falls through to the scalar tail via fn##_step().
  • Everywhere else: loops directly over fn##_step().

The three constants are the only coupling between layers. You write step(). You optionally write step_batch() for SIMD. JM_DEFINE_STEPS owns the rest.

Convention: state->delay[0..LENGTH-1] is the delay line, delay[0] = newest.


FIR filter example

In the component header — define the constants and the SIMD kernel:

#define FIR_TAPS   16              /* algorithm:   number of coefficients       */
#define FIR_LENGTH (FIR_TAPS - 1)  /* history:     samples held in delay[]      */
#define FIR_BATCH  8               /* parallelism: AVX-512 complex samples/call */

#ifdef __AVX512F__
JM_FORCEINLINE JM_HOT void
fir_filter_step_batch(
    fir_filter_state_t     *state,
    const float complex    *window,
    float complex          *out)
{
    __m512 vg  = _mm512_set1_ps(state->gain);
    __m512 acc = _mm512_setzero_ps();
    for (int k = 0; k < FIR_TAPS; k++) {
        __m512 vx = _mm512_loadu_ps((const float *)(window + FIR_LENGTH - k));
        acc = _mm512_fmadd_ps(_mm512_set1_ps(state->coeffs[k]), vx, acc);
    }
    _mm512_storeu_ps((float *)out, _mm512_mul_ps(acc, vg));
}
#endif

FIR_LENGTH is what the macro sees — the history depth. FIR_TAPS is the FIR-specific concept; step_batch() loops over it, and the window index FIR_LENGTH - k reaches back exactly TAPS samples (index 0 = oldest history, index FIR_LENGTH = current sample).

In the component source — tune the chunk size and generate steps():

#define FIR_CHUNK 256  /* tuning: samples per scratch-buffer fill */

JM_DEFINE_STEPS(fir_filter, fir_filter_state_t, float complex,
                FIR_LENGTH, FIR_BATCH, FIR_CHUNK)

See the FIR filter example for a complete walkthrough including benchmarks.


jm_simd.h — width-portable operation macros

Raw AVX intrinsics lock your step_batch() to one ISA. jm_simd.h provides macros that select the widest available instruction set at compile time — AVX-512, AVX2+FMA, or scalar — so the same source compiles everywhere.

Included automatically by jm_perf.h; can also be included standalone.

SIMD tier selection

Tier JM_SIMD_WIDTH_F32 JM_VEC_F32 JM_VEC_F64
AVX-512F 16 __m512 __m512d
AVX2 + FMA 8 __m256 __m256d
Scalar 1 float double

JM_SIMD_WIDTH_F64 is always half of JM_SIMD_WIDTH_F32.

Operation macros

Macro Operation
JM_ZERO_F32() / JM_ZERO_F64() Zero accumulator
JM_SPLAT_F32(x) / JM_SPLAT_F64(x) Broadcast scalar to all lanes
JM_LOAD_F32(ptr) / JM_LOAD_F64(ptr) Unaligned load
JM_STORE_F32(ptr, v) / JM_STORE_F64(ptr, v) Store
JM_ADD_F32(a, b) / JM_ADD_F64(a, b) Element-wise add
JM_MUL_F32(a, b) / JM_MUL_F64(a, b) Element-wise multiply
JM_FMA_F32(acc, a, b) / JM_FMA_F64(...) acc += a * b
JM_MAC_F32(acc, ptr, s) / JM_MAC_F64(...) Load JM_SIMD_WIDTH_F32 floats from ptr, multiply by scalar s, accumulate
JM_HSUM_F32(v) / JM_HSUM_F64(v) Horizontal reduce all lanes to one scalar

Dot product helper

float  jm_dot_f32(const float  *a, const float  *b, int n);
double jm_dot_f64(const double *a, const double *b, int n);

Handles the SIMD loop and scalar tail automatically.

FIR inner loop example

/* step_batch: compute one output sample from JM_SIMD_WIDTH_F32 inputs */
JM_FORCEINLINE JM_HOT void
fir_step_batch(fir_state_t *state, const float *window, float *out)
{
    JM_VEC_F32 acc = JM_ZERO_F32();
    for (int k = 0; k < N_TAPS; k++)
        JM_MAC_F32(acc, window + k, state->coeffs[k]);
    *out = JM_HSUM_F32(acc) * state->gain;
}

On AVX-512 this expands to _mm512_fmadd_ps + _mm512_reduce_add_ps. On scalar it compiles to a plain loop the compiler can auto-vectorise. No #ifdef in user code; the tier is chosen once at the top of jm_simd.h.

Additional hint macros (in jm_perf.h)

Macro Effect
JM_UNROLL(n) #pragma GCC unroll n — ask compiler to unroll loop n times
JM_ASSUME_ALIGNED(ptr, n) __builtin_assume_aligned — enables SIMD loads without alignment penalty
JM_PREFETCH(ptr, rw, loc) __builtin_prefetch — software prefetch; rw=0 read / 1 write, loc=0–3

SIMD build flag

SIMD intrinsics require -march=native -ffast-math. Pass -DENABLE_SIMD=ON to CMake:

cmake -B build -S . -DCMAKE_BUILD_TYPE=Release -DENABLE_SIMD=ON
cmake --build build --parallel