sliding_power — sliding window signal power estimator
Estimate the instantaneous power of a signal over a rolling window of N samples:
P[n] = (1/N) * sum( |x[n-k]|^2 for k in 0..N-1 )
Two update strategies are shown:
- step() — O(1) recursive:
sum_sq += new² − old² - SIMD recompute — horizontally sums the delay line with
JM_ADD_F32/JM_HSUM_F32fromjm_simd.h; used for periodic recalibration and as a clean demonstration of the v0.5 macro set.
TL;DR — see it work first
. <(curl -fsSL https://just-buildit.github.io/just-makeit/install.sh)
just-makeit example sliding_power
# sliding_power: PASSED
Prerequisites
. <(curl -fsSL https://just-buildit.github.io/just-makeit/install.sh)
Pass a custom path to keep the venv somewhere persistent:
. <(curl -fsSL https://just-buildit.github.io/just-makeit/install.sh) -- ~/my-venv
Or with pip if just-makeit is already installed:
pip install just-makeit && just-makeit install-deps
source /tmp/jm-venv/bin/activate
1. Scaffold
The state has three fields: a 64-element float ring buffer (delay) that
stores |x|² for the last 64 samples, a running double accumulator (sum_sq),
and a write-head index (pos). --perf generates jm_perf.h and
jm_simd.h alongside the scaffold.
just-makeit new my_power \
--object power_est \
--state "delay:float[64]" \
--state "sum_sq:double:0.0" \
--state "pos:int:0" \
--perf
2. Implement step()
Replace the generated stub in native/inc/power_est/power_est_core.h with
the recursive O(1) update. The delay line stores |x|² for each past sample;
sum_sq is the running total.
JM_FORCEINLINE JM_HOT float
power_est_step(power_est_state_t *state, float complex x)
{
float re = crealf(x), im = cimagf(x);
float mag_sq = re * re + im * im;
/* O(1) recursive update: subtract the oldest sample, add the new one */
state->sum_sq += (double)(mag_sq - state->delay[state->pos]);
state->delay[state->pos] = mag_sq;
state->pos = (state->pos + 1) & 63; /* window = 64 = 2^6 */
return (float)(state->sum_sq * (1.0 / 64.0));
}
Apply the patch:
python3 .steps/02_patch.py
3. Build and test
make && make test
The generated C test exercises create, reset, step, and steps.
4. Python demo
After pip install -e .:
"""Power estimator demo.
Run from the project root after `pip install -e .`:
python3 .steps/04_demo.py
"""
import cmath
import math
import numpy as np
from my_power import PowerEst
est = PowerEst()
# --- sine wave: expected power = 0.5 (amplitude 1, RMS = 1/sqrt(2)) --------
for n in range(128):
y = est.step(math.sin(2 * math.pi * n / 16) + 0j)
print(f"sine power (expect ~0.500): {y.real:.4f}")
# --- white noise: expected power ≈ variance --------------------------------
rng = np.random.default_rng(42)
noise = (rng.standard_normal(128) + 1j * rng.standard_normal(128)) / math.sqrt(2)
for x in noise:
y = est.step(complex(x))
print(f"noise power (expect ~1.000): {y.real:.4f}")
# --- silence: power decays to zero after 64 samples -----------------------
for _ in range(64):
y = est.step(0j)
print(f"silence power (expect 0.000): {y.real:.4f}")
# --- steps() on a block ---------------------------------------------------
est.reset()
block = np.array([math.sin(2 * math.pi * n / 16) for n in range(128)],
dtype=np.complex64)
out = est.steps(block)
print(f"steps() final power (expect ~0.500): {out[-1].real:.4f}")
Expected output:
sine power (expect ~0.500): 0.5000
noise power (expect ~1.000): 0.9844
silence power (expect 0.000): 0.0000
steps() final power (expect ~0.500): 0.5000
5. SIMD recompute with jm_simd.h
The recursive step() accumulates sum_sq incrementally — fast, but
floating-point rounding errors accumulate over millions of samples. A
periodic full recompute from the delay line corrects any drift.
Add this function to native/src/power_est/power_est_core.c (it needs
jm_simd.h, included via jm_perf.h):
/* Add to power_est_core.c — SIMD recompute from delay line.
*
* Uses jm_simd.h macros: JM_ADD_F32, JM_LOAD_F32, JM_HSUM_F32, JM_UNROLL.
* Compiles to AVX-512, AVX2, or scalar depending on -march flags.
*
* Call every ~1000 samples to correct floating-point drift in sum_sq.
*/
static inline float
power_est_recompute(power_est_state_t *state)
{
JM_VEC_F32 acc = JM_ZERO_F32();
JM_UNROLL(4)
for (int k = 0; k < 64; k += JM_SIMD_WIDTH_F32)
acc = JM_ADD_F32(acc, JM_LOAD_F32(state->delay + k));
float power = JM_HSUM_F32(acc) * (1.0f / 64.0f);
/* Sync the double accumulator so both paths agree */
state->sum_sq = (double)power * 64.0;
return power;
}
What each macro does on each ISA:
| Macro | AVX-512 | AVX2 | Scalar |
|---|---|---|---|
JM_VEC_F32 |
__m512 (16 lanes) |
__m256 (8 lanes) |
float (1 lane) |
JM_LOAD_F32(p) |
_mm512_loadu_ps(p) |
_mm256_loadu_ps(p) |
*(p) |
JM_ADD_F32(a,b) |
_mm512_add_ps(a,b) |
_mm256_add_ps(a,b) |
(a)+(b) |
JM_HSUM_F32(v) |
_mm512_reduce_add_ps(v) |
_mm_hadd_ps(...) |
(v) |
JM_UNROLL(4) |
#pragma GCC unroll 4 |
same | same |
The loop body is identical across all three tiers. JM_SIMD_WIDTH_F32
(16, 8, or 1) controls the stride; the 64-element delay line is always
an exact multiple of any supported width, so there is no scalar tail.
Build with -DENABLE_SIMD=ON to activate AVX-512 or AVX2 paths:
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release -DENABLE_SIMD=ON
cmake --build build --parallel
Numerical notes
The three approaches compute the same quantity — they differ only in how they maintain the running sum.
| Approach | Cost per sample | Drift |
|---|---|---|
| Standard MA (full recompute) | O(N) adds | none — always exact |
| Recursive O(1), float32 acc | O(1) | grows as √n |
| Recursive O(1), double acc | O(1) | negligible |
| Recursive + periodic recompute | O(1) amortised | bounded |
Why drift happens in the recursive form:
Each step() does sum_sq += new² − old². On paper this is exact, but in
floating-point the subtraction can cancel significant bits, and the residual
error accumulates. Over N samples the error grows roughly as
sqrt(N) * eps * sum_sq.
Why it barely matters with a double accumulator:
The delay line holds float values (4-byte, ~7 decimal digits). The
accumulator is double (8-byte, ~15 digits). The inputs only carry 7
digits of information, so the accumulator's extra precision absorbs all
rounding residuals. Over 5 million noise samples the error stays at
~10⁻¹⁵ — below the float32 quantisation noise floor of ~10⁻⁷.
When the SIMD recompute earns its keep:
If the accumulator were float (e.g., for an embedded target with no FPU
double path), drift grows to ~10⁻⁵ by 5M samples and keeps climbing. A
full recompute every 1000 samples — the one call that fits a single SIMD
vector of 16 float32 lanes — resets the error to zero.
You can reproduce these numbers yourself:
python3 .steps/06_compare.py
Output (5M noise samples):
Demo 2: float32 accumulator
sample no-cal calibrated std MA err(no-cal) err(cal)
500,000 0.9821 0.9821 0.9821 1.66e-06 0.00e+00
5,000,000 1.2276 1.2277 1.2277 6.89e-05 0.00e+00
Demo 3: double accumulator (what the C code uses)
sample recursive std MA err
500,000 0.9821 0.9821 8.88e-16
5,000,000 1.2277 1.2277 8.88e-16