fir_filter example
A 16-tap, real-coefficient FIR filter that processes complex (I/Q) signals. Follow along to scaffold, implement, build, and use it yourself.
TL;DR — see it work first
. <(curl -fsSL https://just-buildit.github.io/just-makeit/install.sh)
just-makeit example fir_filter
# fir_filter: PASSED
Prerequisites
. <(curl -fsSL https://just-buildit.github.io/just-makeit/install.sh)
Pass a custom path to keep the venv somewhere persistent:
. <(curl -fsSL https://just-buildit.github.io/just-makeit/install.sh) -- ~/my-venv
Or with pip if just-makeit is already installed:
pip install just-makeit && just-makeit install-deps
source /tmp/jm-venv/bin/activate
1. Scaffold
just-makeit new my_fir \
--object fir_filter \
--state "coeffs:float[16]" \
--state "delay:float _Complex[16]" \
--state "gain:float:1.0"
Three state variables:
| Name | Type | Role | Constructor param? |
|---|---|---|---|
coeffs |
float[16] |
Real tap weights | No — load via set_coeffs() |
delay |
float _Complex[16] |
Complex delay line (history) | No — zero on create/reset |
gain |
float |
Output scalar gain | Yes — default 1.0 |
coeffs and delay are inline in the C struct — no heap allocation per field.
2. Implement
Open native/inc/fir_filter/fir_filter_core.h and replace the fir_filter_step stub.
The filter must update the delay line, so the signature changes from const to mutable:
// before
static inline float complex fir_filter_step(const fir_filter_state_t *state, float complex x) {
(void)state; /* TODO: implement DSP using state variables */
return x;
}
// after
static inline float complex fir_filter_step(fir_filter_state_t *state, float complex x) {
/* Shift delay line — oldest sample falls off the end */
memmove(&state->delay[1], &state->delay[0], (16 - 1) * sizeof(float complex));
state->delay[0] = x;
/* Convolve: y = sum_k( coeffs[k] * delay[k] ) */
float complex y = 0.0f + 0.0f * I;
for (int k = 0; k < 16; k++)
y += state->coeffs[k] * state->delay[k];
return (float complex)state->gain * y;
}
fir_filter_steps() in fir_filter_core.c loops over this automatically —
no changes needed there.
3. Build and test
make
make test
The generated tests cover getter/setter round-trips, reset behaviour, the context manager, and destroy. After implementing the filter you can add signal-level tests (see step 5).
4. Try it from Python
pip install -e .
import numpy as np
from my_fir import FirFilter
f = FirFilter(gain=1.0)
# Load a 3-tap averager into the first three taps
h = np.array([0.25, 0.5, 0.25] + [0.0] * 13, dtype=np.float32)
f.set_coeffs(h)
# Inspect taps without copying — read-only view, zero allocation
view = f.get_coeffs_view()
print("writeable:", view.flags["WRITEABLE"]) # False
print("h[1]:", view[1]) # 0.5
# Feed a unit impulse and read back the impulse response
impulse = np.zeros(16, dtype=np.complex64)
impulse[0] = 1.0
y = f.steps(impulse)
print("impulse response:", y[:4].real) # [0.25 0.5 0.25 0. ]
# Snapshot the delay line — independent copy, safe to keep indefinitely
dl = f.get_delay()
print("delay[0]:", dl[0])
# Context manager ensures destroy() on exit
with FirFilter(gain=2.0) as g:
g.set_coeffs(h)
y2 = g.steps(impulse)
print("gain=2 response:", y2[:3].real) # [0.5 1. 0.5]
5. Try it from C
After make, the combined shared library is at build/libmy_fir.so.
// demo.c
#include "fir_filter/fir_filter_core.h"
#include <complex.h>
#include <stdio.h>
int main(void) {
fir_filter_state_t *f = fir_filter_create(1.0f);
float h[16] = {0};
h[0] = 0.25f;
h[1] = 0.5f;
h[2] = 0.25f;
fir_filter_set_coeffs(f, h);
/* Read taps without copying — pointer valid until fir_filter_destroy(f) */
const float *view = fir_filter_get_coeffs_view(f);
printf("h[1] = %.2f\n", view[1]); /* 0.50 */
/* Feed a unit impulse */
float complex in[16] = {0};
float complex out[16] = {0};
in[0] = 1.0f + 0.0f * I;
fir_filter_steps(f, in, out, 16);
printf("out[0]=%.2f out[1]=%.2f out[2]=%.2f\n", crealf(out[0]), crealf(out[1]),
crealf(out[2])); /* 0.25 0.50 0.25 */
/* Snapshot the delay line — independent copy */
float _Complex dl[16];
fir_filter_get_delay(f, dl);
printf("delay[0] = %.3f + %.3fj\n", crealf(dl[0]), cimagf(dl[0]));
fir_filter_reset(f); /* clears delay and coeffs, restores gain = 1.0f */
fir_filter_destroy(f);
return 0;
}
gcc -O2 -std=c99 -Inative/inc demo.c \
-Lbuild -lmy_fir -Wl,-rpath,build \
-lm -o demo && ./demo
6. Add more state
just-makeit add --state n_taps:int32_t:16
make test
Or swap in a longer delay line without touching your implementation:
just-makeit add --state "coeffs64:double _Complex[64]"
7. Bonus: --perf + SIMD benchmark
From inside my_fir:
# run from inside my_fir
just-makeit perf
Save the benchmark below as bench.py:
import timeit
import numpy as np
from my_fir import FirFilter
BLOCK = 100_000
RUNS = 500
f = FirFilter(gain=1.0)
h = np.array([0.25, 0.5, 0.25] + [0.0] * 13, dtype=np.float32)
f.set_coeffs(h)
signal = (np.random.randn(BLOCK) + 1j * np.random.randn(BLOCK)).astype(np.complex64)
elapsed = min(timeit.repeat(lambda: f.steps(signal), number=RUNS, repeat=5))
print(f"{RUNS * BLOCK / elapsed / 1e6:.1f} M complex samples/sec")
Build baseline, measure, rebuild with SIMD, measure again:
# Baseline build (no SIMD)
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release \
-DPython3_EXECUTABLE=$(python3 -c "import sys; print(sys.executable)")
cmake --build build --parallel
pip install -e . --force-reinstall
python3 bench.py
# Rebuild with SIMD
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release -DENABLE_SIMD=ON \
-DPython3_EXECUTABLE=$(python3 -c "import sys; print(sys.executable)")
cmake --build build --parallel
pip install -e . --force-reinstall
python3 bench.py
Round 1 — flags alone
The scaffold's default implementation shifts the delay line with memmove.
Adding -march=native -ffast-math via ENABLE_SIMD=ON gives a modest gain:
baseline: 106.8 M complex samples/sec
with SIMD: 154.1 M complex samples/sec (1.4×)
The ceiling is the memmove of 120 bytes (15 float complex) that runs every
sample. The vectoriser can auto-vectorise the 16-tap MAC, but it can't overlap
that store with the accumulate. Flags alone don't get you there.
Round 2 — algorithm matters
Three concerns, three places. jm_perf.h ships a JM_DEFINE_STEPS macro
that stamps out the outer dispatch loop so you never write it by hand.
1. Add the constants and fir_filter_step_batch() to
native/inc/fir_filter/fir_filter_core.h just after fir_filter_step():
#define FIR_TAPS 16 /* algorithm: number of coefficients */
#define FIR_LENGTH (FIR_TAPS - 1) /* history: samples held in delay[] */
/* JM_SIMD_WIDTH_F32 floats = JM_SIMD_WIDTH_F32/2 complex samples per batch.
* On scalar targets (width=1) this is 0; _JM_STEPS_SIMD_ is a no-op there. */
#define FIR_BATCH (JM_SIMD_WIDTH_F32 / 2)
#if JM_SIMD_WIDTH_F32 > 1
JM_FORCEINLINE JM_HOT void
fir_filter_step_batch(
fir_filter_state_t *state,
const float complex *window,
float complex *out)
{
JM_VEC_F32 acc = JM_ZERO_F32();
for (int k = 0; k < FIR_TAPS; k++)
JM_MAC_F32(acc, (const float *)(window + FIR_LENGTH - k), state->coeffs[k]);
JM_STORE_F32((float *)out, JM_MUL_F32(acc, JM_SPLAT_F32(state->gain)));
}
#endif
Three named constants make each concern explicit:
| constant | concern | meaning |
|---|---|---|
FIR_TAPS |
algorithm | filter length (set at codegen time) |
FIR_BATCH |
parallelism | complex samples per call (JM_SIMD_WIDTH_F32 / 2) |
FIR_CHUNK |
tuning | samples per scratch-buffer fill |
FIR_BATCH is derived from JM_SIMD_WIDTH_F32 (16 on AVX-512, 8 on AVX2),
so the same source compiles to 8 or 4 complex samples per batch without any
#ifdef. On scalar targets JM_SIMD_WIDTH_F32 = 1, _JM_STEPS_SIMD_ is a
no-op, and step_batch() is never called.
step_batch() uses FIR_TAPS and FIR_BATCH. steps() uses all three —
but you never write steps().
2. Replace fir_filter_steps in native/src/fir_filter/fir_filter_core.c:
#define FIR_CHUNK 256 /* tuning: samples per scratch-buffer fill */
JM_DEFINE_STEPS(fir_filter, fir_filter_state_t, float complex,
FIR_LENGTH, FIR_BATCH, FIR_CHUNK)
JM_DEFINE_STEPS generates fir_filter_steps() from the macro in jm_perf.h:
it owns the scratch buffer, the chunked fill, and the scalar tail. You write
step(). You write step_batch(). The rest is infrastructure.
baseline: 475 M complex samples/sec
with SIMD: 1745 M complex samples/sec (3.7×)
The scalar baseline is already 4.5× faster than the memmove version because
sequential scratch accesses are hardware-prefetcher-friendly; the L1-resident
chunk eliminates the circular-buffer index arithmetic entirely. Adding
ENABLE_SIMD=ON delivers the full speedup from AVX-512's 16-wide float FMA
(3.7×) or AVX2's 8-wide FMA — jm_simd.h selects the best tier at compile
time, no source changes needed.