fir_filter example¶

A 16-tap, real-coefficient FIR filter that processes complex (I/Q) signals. Follow along to scaffold, implement, build, and use it yourself.

TL;DR — see it work first¶

. <(curl -fsSL https://just-buildit.github.io/just-makeit/install.sh)
just-makeit example fir_filter
# fir_filter: PASSED

Prerequisites¶

. <(curl -fsSL https://just-buildit.github.io/just-makeit/install.sh)

Pass a custom path to keep the venv somewhere persistent:

. <(curl -fsSL https://just-buildit.github.io/just-makeit/install.sh) -- ~/my-venv

Or with pip if just-makeit is already installed:

pip install just-makeit && just-makeit install-deps
source /tmp/jm-venv/bin/activate

1. Scaffold¶

just-makeit new my_fir \
    --object fir_filter \
    --state "coeffs:float[16]" \
    --state "delay:float _Complex[16]" \
    --state "gain:float:1.0"

Three state variables:

Name	Type	Role	Constructor param?
`coeffs`	`float[16]`	Real tap weights	No — load via `set_coeffs()`
`delay`	`float _Complex[16]`	Complex delay line (history)	No — zero on create/reset
`gain`	`float`	Output scalar gain	Yes — default `1.0`

coeffs and delay are inline in the C struct — no heap allocation per field.

2. Implement¶

Open native/inc/fir_filter/fir_filter_core.h and replace the fir_filter_step stub. The filter must update the delay line, so the signature changes from const to mutable:

// before
static inline float complex
fir_filter_step (const fir_filter_state_t *state, float complex x)
{
  (void)state; /* TODO: implement DSP using state variables */
  return x;
}

// after
static inline float complex
fir_filter_step (fir_filter_state_t *state, float complex x)
{
  /* Shift delay line — oldest sample falls off the end */
  memmove (&state->delay[1], &state->delay[0],
           (16 - 1) * sizeof (float complex));
  state->delay[0] = x;

  /* Convolve: y = sum_k( coeffs[k] * delay[k] ) */
  float complex y = 0.0f + 0.0f * I;
  for (int k = 0; k < 16; k++)
    y += state->coeffs[k] * state->delay[k];

  return (float complex)state->gain * y;
}

fir_filter_steps() in fir_filter_core.c loops over this automatically — no changes needed there.

3. Build and test¶

make
make test

The generated tests cover getter/setter round-trips, reset behaviour, the context manager, and destroy. After implementing the filter you can add signal-level tests (see step 5).

4. Try it from Python¶

pip install -e .

import numpy as np
from my_fir import FirFilter

f = FirFilter(gain=1.0)

# Load a 3-tap averager into the first three taps
h = np.array([0.25, 0.5, 0.25] + [0.0] * 13, dtype=np.float32)
f.set_coeffs(h)

# Inspect taps without copying — read-only view, zero allocation
view = f.get_coeffs_view()
print("writeable:", view.flags["WRITEABLE"])  # False
print("h[1]:", view[1])  # 0.5

# Feed a unit impulse and read back the impulse response
impulse = np.zeros(16, dtype=np.complex64)
impulse[0] = 1.0
y = f.steps(impulse)
print("impulse response:", y[:4].real)  # [0.25 0.5  0.25 0.  ]

# Snapshot the delay line — independent copy, safe to keep indefinitely
dl = f.get_delay()
print("delay[0]:", dl[0])

# Context manager ensures destroy() on exit
with FirFilter(gain=2.0) as g:
    g.set_coeffs(h)
    y2 = g.steps(impulse)
print("gain=2 response:", y2[:3].real)  # [0.5 1.  0.5]

5. Try it from C¶

After make, the combined shared library is at build/libmy_fir.so.

// demo.c
#include "fir_filter/fir_filter_core.h"
#include <complex.h>
#include <stdio.h>

int
main (void)
{
  fir_filter_state_t *f = fir_filter_create (1.0f);

  float h[16] = { 0 };
  h[0]        = 0.25f;
  h[1]        = 0.5f;
  h[2]        = 0.25f;
  fir_filter_set_coeffs (f, h);

  /* Read taps without copying — pointer valid until fir_filter_destroy(f) */
  const float *view = fir_filter_get_coeffs_view (f);
  printf ("h[1] = %.2f\n", view[1]); /* 0.50 */

  /* Feed a unit impulse */
  float complex in[16]  = { 0 };
  float complex out[16] = { 0 };
  in[0]                 = 1.0f + 0.0f * I;
  fir_filter_steps (f, in, out, 16);

  printf ("out[0]=%.2f  out[1]=%.2f  out[2]=%.2f\n", crealf (out[0]),
          crealf (out[1]), crealf (out[2])); /* 0.25  0.50  0.25 */

  /* Snapshot the delay line — independent copy */
  float _Complex dl[16];
  fir_filter_get_delay (f, dl);
  printf ("delay[0] = %.3f + %.3fj\n", crealf (dl[0]), cimagf (dl[0]));

  fir_filter_reset (f); /* clears delay and coeffs, restores gain = 1.0f */
  fir_filter_destroy (f);
  return 0;
}

gcc -O2 -std=c99 -Inative/inc demo.c \
    -Lbuild -lmy_fir -Wl,-rpath,build \
    -lm -o demo && ./demo

6. Add more state¶

just-makeit add --state n_taps:int32_t:16
make test

State is structural, so add rebuilds the object from the manifest: the fir_filter_state_t struct and lifecycle are regenerated and your fir_filter_step() body is reset to a fresh stub. Re-run the implement step (section 2) to restore the kernel on top of the new state. The same applies when you swap in a longer delay line:

just-makeit add --state "coeffs64:double _Complex[64]"

7. Bonus: `--perf` + SIMD benchmark¶

From inside my_fir:

# run from inside my_fir
just-makeit perf

Save the benchmark below as bench.py:

import timeit
import numpy as np
from my_fir import FirFilter

BLOCK = 100_000
RUNS = 500

f = FirFilter(gain=1.0)
h = np.array([0.25, 0.5, 0.25] + [0.0] * 13, dtype=np.float32)
f.set_coeffs(h)
signal = (np.random.randn(BLOCK) + 1j * np.random.randn(BLOCK)).astype(
    np.complex64
)

elapsed = min(timeit.repeat(lambda: f.steps(signal), number=RUNS, repeat=5))
print(f"{RUNS * BLOCK / elapsed / 1e6:.1f} M complex samples/sec")

Build baseline, measure, rebuild with SIMD, measure again:

# Baseline build (no SIMD)
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release \
    -DPython3_EXECUTABLE=$(python3 -c "import sys; print(sys.executable)")
cmake --build build --parallel
pip install -e . --force-reinstall
python3 bench.py

# Rebuild with SIMD
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release -DENABLE_SIMD=ON \
    -DPython3_EXECUTABLE=$(python3 -c "import sys; print(sys.executable)")
cmake --build build --parallel
pip install -e . --force-reinstall
python3 bench.py

Round 1 — flags alone¶

The scaffold's default implementation shifts the delay line with memmove. Adding -march=native -ffast-math via ENABLE_SIMD=ON gives a modest gain:

baseline:  106.8 M complex samples/sec
with SIMD: 154.1 M complex samples/sec   (1.4×)

The ceiling is the memmove of 120 bytes (15 float complex) that runs every sample. The vectoriser can auto-vectorise the 16-tap MAC, but it can't overlap that store with the accumulate. Flags alone don't get you there.

Round 2 — algorithm matters¶

Three concerns, three places. jm_perf.h ships a JM_DEFINE_STEPS macro that stamps out the outer dispatch loop so you never write it by hand.

1. Add the constants and fir_filter_step_batch() to native/inc/fir_filter/fir_filter_core.h just after fir_filter_step():

#define FIR_TAPS 16 /* algorithm:   number of coefficients       */
#define FIR_LENGTH                                                            \
  (FIR_TAPS - 1) /* history:     samples held in delay[]      */
/* JM_SIMD_WIDTH_F32 floats = JM_SIMD_WIDTH_F32/2 complex samples per batch.
 * On scalar targets (width=1) this is 0; _JM_STEPS_SIMD_ is a no-op there. */
#define FIR_BATCH (JM_SIMD_WIDTH_F32 / 2)

#if JM_SIMD_WIDTH_F32 > 1
JM_FORCEINLINE JM_HOT void
fir_filter_step_batch (fir_filter_state_t *state, const float complex *window,
                       float complex *out)
{
  JM_VEC_F32 acc = JM_ZERO_F32 ();
  for (int k = 0; k < FIR_TAPS; k++)
    JM_MAC_F32 (acc, (const float *)(window + FIR_LENGTH - k),
                state->coeffs[k]);
  JM_STORE_F32 ((float *)out, JM_MUL_F32 (acc, JM_SPLAT_F32 (state->gain)));
}
#endif

Three named constants make each concern explicit:

constant	concern	meaning
`FIR_TAPS`	algorithm	filter length (set at codegen time)
`FIR_BATCH`	parallelism	complex samples per call (`JM_SIMD_WIDTH_F32 / 2`)
`FIR_CHUNK`	tuning	samples per scratch-buffer fill

FIR_BATCH is derived from JM_SIMD_WIDTH_F32 (16 on AVX-512, 8 on AVX2), so the same source compiles to 8 or 4 complex samples per batch without any #ifdef. On scalar targets JM_SIMD_WIDTH_F32 = 1, _JM_STEPS_SIMD_ is a no-op, and step_batch() is never called.

step_batch() uses FIR_TAPS and FIR_BATCH. steps() uses all three — but you never write steps().

2. Replace fir_filter_steps in native/src/fir_filter/fir_filter_core.c:

#define FIR_CHUNK 256 /* tuning: samples per scratch-buffer fill */

JM_DEFINE_STEPS (fir_filter, fir_filter_state_t, float complex, FIR_LENGTH,
                 FIR_BATCH, FIR_CHUNK)

JM_DEFINE_STEPS generates fir_filter_steps() from the macro in jm_perf.h: it owns the scratch buffer, the chunked fill, and the scalar tail. You write step(). You write step_batch(). The rest is infrastructure.

baseline:   475 M complex samples/sec
with SIMD: 1745 M complex samples/sec   (3.7×)

The scalar baseline is already 4.5× faster than the memmove version because sequential scratch accesses are hardware-prefetcher-friendly; the L1-resident chunk eliminates the circular-buffer index arithmetic entirely. Adding ENABLE_SIMD=ON delivers the full speedup from AVX-512's 16-wide float FMA (3.7×) or AVX2's 8-wide FMA — jm_simd.h selects the best tier at compile time, no source changes needed.