Array processing example
Every object just-makeit generates can process a block of samples in one call.
This example walks through every way the CLI exposes that capability, from the
free steps() that comes with every object to --variable-output batch methods
with multiple output streams.
Along the way, each section explains who owns the memory, when it is allocated, and what the Python caller can safely do with the returned array.
Six patterns, six sections:
| # | Pattern | Output allocation | Who owns it |
|---|---|---|---|
| 1 | Auto-generated steps() |
Per call (or zero if out= supplied) |
Caller (numpy) |
| 2 | method scalar stub + hand-written _steps() |
Per call (or zero if out= supplied) |
Caller (numpy) |
| 3 | method --variable-output |
Allocated at __init__, re-used |
Object (zero-copy view) |
| 4 | method --variable-output --multi-output |
Same — one buffer per stream | Object (tuple of views) |
| 5 | --arg-type type[] (buffer primary arg) |
Caller supplies input buffer | Caller (input) |
| 6 | method --out-type (per-call typed output array) |
Per call (PyArray_EMPTY) |
Caller (numpy) |
All six patterns share a common rule: inline float[N] state arrays in the
C struct require no heap allocation — they are part of the struct itself.
Heap allocation only appears when the output size is not fixed at compile time.
TL;DR — see it work first
. <(curl -fsSL https://just-buildit.github.io/just-makeit/install.sh)
just-makeit example array_processing
# array_processing: PASSED
Prerequisites
. <(curl -fsSL https://just-buildit.github.io/just-makeit/install.sh)
Pass a custom path to keep the venv somewhere persistent:
. <(curl -fsSL https://just-buildit.github.io/just-makeit/install.sh) -- ~/my-venv
Or with pip if just-makeit is already installed:
pip install just-makeit && just-makeit install-deps
source /tmp/jm-venv/bin/activate
1. Auto-generated steps() — free with every object
just-makeit new my_arrays \
--object ema \
--arg-type float \
--return-type float \
--state alpha:float:0.1f \
--state prev:float:0.0f
cd my_arrays
Every just-makeit object generates both step() and steps():
| C function | Signature |
|---|---|
ema_step |
float ema_step(ema_state_t *s, float x) |
ema_steps |
void ema_steps(ema_state_t *s, const float *in, float *out, size_t n) |
steps() is a thin loop in native/src/ema/ema_core.c — it calls step()
once per sample. You implement step(); steps() comes for free.
What Python sees
import numpy as np
from my_arrays import Ema
f = Ema(alpha=0.1)
block = np.random.randn(1024).astype(np.float32)
out = f.steps(block) # returns np.ndarray, shape (1024,), dtype float32
steps() allocates a fresh numpy array on every call (PyArray_SimpleNew) and
returns it. The caller owns that array outright — the object holds no reference
to it and never touches it again.
The C API — caller-supplied pointers, no allocation
At the C level, steps() takes both pointers from the caller and allocates
nothing:
/* Output buffer must be pre-allocated by caller. */
void ema_steps(ema_state_t *state,
const float *input,
float *output,
size_t n);
This is true with or without --perf: JM_DEFINE_STEPS only replaces the
loop body (adding SIMD dispatch), not the signature or the allocation model.
The Python ext — one malloc per call
The ext is the only place an allocation happens. It calls PyArray_SimpleNew
to create the output array, passes the raw pointer to ema_steps, then
returns the numpy array to the caller:
call f.steps(block)
│
├─ ext calls PyArray_SimpleNew(n) ← one malloc, every call
│
├─ calls ema_steps(state, block.data, out.data, 1024)
│ └─ no allocation inside; fills out[] in place
│
└─ returns ndarray to caller
ownership: caller
lifetime: indefinite — safe to hold, copy, or discard at will
Successive calls are independent: the previous result is never overwritten.
This is the opposite of --variable-output (§3), where the object owns a
fixed buffer and reuses it each call.
Eliminating the per-call malloc with out=
Pass a pre-allocated numpy array as the second argument and the ext writes
directly into it — PyArray_SimpleNew is skipped entirely:
buf = np.empty(1024, dtype=np.float32) # allocate once
for block in stream:
f.steps(block, buf) # zero allocation on the hot path
The returned object is the same array you passed in (ret is buf), so you
can ignore the return value or use it for chaining. The buffer must be
C-contiguous, the correct dtype, and at least as long as the input.
call f.steps(block, buf)
│
├─ ext validates buf: dtype, C-contiguous, len == n
│
├─ calls ema_steps(state, block.data, buf.data, 1024)
│ └─ no allocation; fills buf in place
│
└─ returns buf (same object, new reference)
ownership: caller retains
lifetime: safe to reuse immediately on next call
This is the right choice for any processing loop where throughput matters.
For one-shot calls or exploratory work the default (no out=) is simpler.
Inline array state — no heap per field
If your object has fixed-length array state (e.g. --state "coeffs:float[16]"),
those arrays live inside the C struct, not on the heap:
typedef struct {
float coeffs[16]; /* inline — no extra malloc */
float delay[16]; /* inline */
float gain;
} ema_state_t;
ema_create() does exactly one malloc for the whole struct. There is no
malloc per field, no pointer to chase, and no fragmentation.
Contrast this with a hypothetical float *coeffs pointer: that would require
a separate allocation, a separate free, and careful ownership accounting.
just-makeit avoids this entirely by embedding arrays inline whenever the length
is fixed at code-generation time.
2. method — scalar stub + hand-written _steps()
Use just-makeit method when you need an execute path with different
input or output types than the primary step().
# Add a second execute method with a different I/O type.
# This object produces uint32 phase words in addition to float output.
just-makeit method ema quantize \
--arg-type float \
--return-type uint32_t
The command appends a scalar C stub to native/src/ema/ema_core.c:
uint32_t ema_quantize(ema_state_t *state, float x);
For 1:1-rate batch work (output count equals input count), write the
_steps() companion by hand in the same file:
/* Hand-written batch companion for ema_quantize().
* Add this to native/src/ema/ema_core.c after implementing the scalar stub.
* The Python ext allocates out[] via PyArray_SimpleNew before calling this;
* the Python caller only passes the input array.
* This is the right pattern when output count == input count (1:1 rate).
*/
void
ema_quantize_steps(ema_state_t *state,
const float *in,
uint32_t *out,
size_t n)
{
for (size_t i = 0; i < n; i++)
out[i] = ema_quantize(state, in[i]);
}
Then wire it into native/src/ema/ema_ext.c following the ema_steps
pattern already there.
Array ownership for hand-written _steps()
The Python caller's experience is identical to the auto-generated steps():
pass one input array, get back a new numpy array.
call f.quantize_steps(block)
│
├─ ext calls PyArray_SimpleNew(n, uint32) ← one malloc, every call
│
├─ calls ema_quantize_steps(state, block.data, out.data, n)
│ └─ loop: out[i] = ema_quantize(state, block[i])
│
└─ returns ndarray to caller
ownership: caller
lifetime: indefinite — object holds no reference to it
The C function ema_quantize_steps takes both pointers, but the ext owns
that allocation — the Python caller never passes or manages an output buffer.
When to use this pattern
- You need a different input or output type than the primary
step(). - Output count equals input count (1:1 rate).
- Straightforward; no infrastructure beyond the loop.
When not to use it
If the maximum output count depends on object state and is knowable at init
time (e.g. a decimator), --variable-output is more ergonomic — it removes
the per-call allocation from the caller's responsibility. See §3.
3. method --variable-output — pre-allocated, zero-copy batch
Use this when the maximum output count is bounded by state and knowable at
init time. The classic case is a rate-changing block: a 2× decimator with
block size B can produce at most ceil(B / 2) outputs per call.
# A half-band decimator: input block of N complex samples, output ≤ N/2 samples.
# Because the maximum output is known at init time (ceil(block_size / 2)),
# --variable-output pre-allocates the output buffer once and returns a view.
cd ..
just-makeit new my_decim \
--object hbdecim \
--arg-type "float _Complex" \
--return-type "float _Complex" \
--state "delay:float _Complex[12]"
cd my_decim
just-makeit method hbdecim execute \
--arg-type "float _Complex" \
--return-type "float _Complex" \
--variable-output
The command appends two C stubs to native/src/hbdecim/hbdecim_core.c:
| Stub | When called | Your job |
|---|---|---|
hbdecim_execute_max_out(state) |
Once at Python __init__ |
Return the output bound |
hbdecim_execute(state, in, n_in, out) |
Every Python call | Fill out, return actual count |
Implement both:
/* Implement in native/src/hbdecim/hbdecim_core.c.
*
* The Python ext calls this once at __init__ to size the pre-allocated
* output buffer. Return the largest n_out that execute() can ever produce
* for any valid call. Here: block_size / 2, rounded up.
*
* Must be positive. Returning 0 causes malloc(0), which is implementation-
* defined and will likely produce a silent bug.
*/
size_t
hbdecim_execute_max_out(hbdecim_state_t *state)
{
/* state->block_size is a constructor parameter (add with just-makeit add) */
return (state->block_size + 1) / 2;
}
/* Process n_in samples; write actual output count to *out; return n_out.
* The caller (Python ext) supplies the pre-allocated output buffer.
*/
size_t
hbdecim_execute(hbdecim_state_t *state,
const float complex *in,
size_t n_in,
float complex *out)
{
size_t n_out = 0;
for (size_t i = 0; i + 1 < n_in; i += 2) {
/* TODO: polyphase half-band implementation */
out[n_out++] = (in[i] + in[i + 1]) * 0.5f;
}
return n_out;
}
What Python sees
import numpy as np
from my_decim import Hbdecim
d = Hbdecim() # __init__ calls execute_max_out(); mallocs output buffer once
block = (np.random.randn(1024) + 1j * np.random.randn(1024)).astype(np.complex64)
view = d.execute(block) # returns zero-copy view; shape (≤512,)
d.execute(block) returns a numpy view into the object's internal output
buffer. No allocation happens on this call path at all.
Array ownership for --variable-output
d = Hbdecim()
│
└─ ext calls hbdecim_execute_max_out() → 512
ext mallocs float complex[512] ← one malloc, at __init__
stored as d._out_buf (opaque)
view = d.execute(block)
│
├─ calls hbdecim_execute(state, block.data, 1024, d._out_buf) → returns 512
│
└─ returns numpy view wrapping d._out_buf[:512]
ownership: object retains the buffer
lifetime: view is valid until the NEXT call to d.execute()
— do not hold the view across calls; copy if you need to keep it
# Safe: process, then copy if needed
view = d.execute(block)
keep = view.copy() # independent array, survives next call
Critical constraint: the view becomes stale on the next execute() call
because the object overwrites the same buffer. Copy before calling again if
you need to retain more than one block.
When to use --variable-output
| Use case | _max_out returns |
Appropriate? |
|---|---|---|
| Decimator, ratio R, block size B | ceil(B / R) |
Yes |
| FIFO with fixed capacity C | C |
Yes |
| FIR filter, 1:1 rate | unknown at init | No — output size = input size; use auto steps() |
| Integrator / accumulator | 1 per sample | No — use scalar step() |
| Overflow detector, 1:1 rate | unknown at init | No — use scalar method + hand-written _steps() |
4. method --variable-output --multi-output — parallel output streams
--multi-output TYPE adds a second pre-allocated output buffer alongside the
primary one. The Python call returns a tuple. The flag is repeatable for
three or more streams.
# Two parallel output streams from one call:
# primary: float _Complex (filtered samples)
# secondary: uint8_t (per-sample overflow flag)
just-makeit method hbdecim execute_ovf \
--arg-type "float _Complex" \
--return-type "float _Complex" \
--variable-output \
--multi-output uint8_t
Generated stubs appended to hbdecim_core.c:
size_t hbdecim_execute_ovf_max_out(hbdecim_state_t *state);
size_t hbdecim_execute_ovf(hbdecim_state_t *state,
const float complex *in, size_t n_in,
float complex *out,
uint8_t *ovf);
Both out and ovf are pre-allocated to _max_out() elements and owned by
the object. Your implementation fills both and returns the count:
/* Implement in native/src/hbdecim/hbdecim_core.c.
*
* Two output arrays: primary (filtered samples) and secondary (overflow flags).
* Both are pre-allocated by the ext to execute_ovf_max_out() elements.
* Return the actual count written to both arrays.
*/
size_t
hbdecim_execute_ovf_max_out(hbdecim_state_t *state)
{
return (state->block_size + 1) / 2;
}
size_t
hbdecim_execute_ovf(hbdecim_state_t *state,
const float complex *in,
size_t n_in,
float complex *out, /* primary */
uint8_t *ovf) /* secondary */
{
size_t n_out = 0;
for (size_t i = 0; i + 1 < n_in; i += 2) {
float complex y = (in[i] + in[i + 1]) * 0.5f;
out[n_out] = y;
ovf[n_out] = (cabsf(y) > 1.0f) ? 1 : 0;
n_out++;
}
return n_out;
}
What Python sees
import numpy as np
from my_decim import Hbdecim
d = Hbdecim()
block = (np.random.randn(1024) + 1j * np.random.randn(1024)).astype(np.complex64)
samples, flags = d.execute_ovf(block) # tuple of two zero-copy views
Array ownership for multi-output
d = Hbdecim()
│
└─ ext mallocs float complex[512] → d._out_buf
ext mallocs uint8_t[512] → d._ovf_buf
both stored in the object
samples, flags = d.execute_ovf(block)
│
├─ calls hbdecim_execute_ovf(..., d._out_buf, d._ovf_buf) → returns 512
│
├─ returns (view into d._out_buf[:512],
│ view into d._ovf_buf[:512])
│
│ ownership: object retains both buffers
│ lifetime: both views stale after next call to execute_ovf()
│ — copy before calling again
n_ovf = int(flags.sum()) # safe — flags is still valid here
samples_copy = samples.copy() # independent; survives next call
The same "stale after next call" rule applies to every buffer produced by
--variable-output. The zero-copy design makes the steady-state path
allocation-free; the copy obligation is the trade-off.
5. --arg-type type[] — array-buffer primary arg
Some objects are designed to consume an entire buffer in one call — a
decimator, a packet framer, a block codec. Wrapping them with a scalar
step() + auto-generated steps() adds indirection that compilers cannot
always eliminate. Pass [] on the arg type to express this directly.
Omitting --return-type defaults to void — the natural choice for a
pure consumer. Add --return-type T to get a scalar back (e.g. a packet
framer returning the number of packets emitted).
Void-return (default):
just-makeit new my_sink \
--object buf_sink \
--arg-type "float _Complex[]" \
--state "count:int32_t:0"
Scalar-return (explicit):
just-makeit new my_buf \
--object buf_proc \
--arg-type "float _Complex[]" \
--return-type int \
--state "count:int32_t:0"
The generated step() takes a pointer and length:
/* void-return variant */
void buf_sink_step(buf_sink_state_t *state,
const float complex *x, size_t x_len)
{
(void)state; (void)x; (void)x_len; /* TODO: implement */
}
/* scalar-return variant */
int buf_proc_step(buf_proc_state_t *state,
const float complex *x, size_t x_len)
{
(void)x; (void)x_len;
return 0; /* TODO: implement */
}
steps() is not generated — the primary operation already takes a buffer.
What Python sees
import numpy as np
from my_sink import BufSink
from my_buf import BufProc
block = (np.random.randn(1024) + 1j * np.random.randn(1024)).astype(np.complex64)
sink = BufSink()
sink.step(block) # void — consumes the buffer
proc = BufProc()
n = proc.step(block) # returns int — e.g. packets emitted
Type stub (my_buf/src/my_buf/buf_proc.pyi)
class BufProc:
def __init__(self, count: np.int32 = 0) -> None: ...
def step(self, x: NDArray[np.complex64]) -> int:
"""Process one sample."""
# no steps() — the primary op already takes a buffer
6. method --out-type — per-call typed output array
Use this when your method takes an array input via --param and needs to
return an output array of a different type — and the output length is
simply input_length / divisor (known per call, not at init).
The difference from --variable-output:
--variable-output |
--out-type |
|
|---|---|---|
| Buffer lifetime | Pre-allocated at __init__; zero-copy view |
Allocated fresh each call |
| Output type | Same as --return-type |
Independent --out-type |
| Output length | _max_out(state) — set at init |
input_len / --out-divisor |
Classic use case: a type converter that takes a CI8 byte buffer and returns
CF32 samples — output length is input_bytes / 2.
cd ..
just-makeit new my_conv \
--object ci8_conv \
--state "gain:float:1.0"
cd my_conv
just-makeit method ci8_conv convert \
--param raw:int8_t[] \
--out-type "float _Complex" \
--out-divisor 2 \
--return-type void
--out-divisor 2 means: output length = raw_len / 2 (each complex sample
is two bytes: one I + one Q).
Generated C stub appended to ci8_conv_core.c:
void
ci8_conv_convert(ci8_conv_state_t *state,
const int8_t *raw, size_t raw_len,
float complex *out)
{
(void)state; (void)raw; (void)raw_len; (void)out;
}
Your implementation:
void
ci8_conv_convert(ci8_conv_state_t *state,
const int8_t *raw, size_t raw_len,
float complex *out)
{
size_t n = raw_len / 2;
float scale = state->gain / 128.0f;
for (size_t i = 0; i < n; i++)
out[i] = (raw[2*i] + 1j * raw[2*i+1]) * scale;
}
What Python sees
import numpy as np
from my_conv import Ci8Conv
conv = Ci8Conv(gain=1.0)
raw = np.frombuffer(iq_bytes, dtype=np.int8)
cf32 = conv.convert(raw) # returns np.ndarray, dtype=complex64, shape (len(raw)//2,)
Array ownership for --out-type
cf32 = conv.convert(raw)
│
├─ ext computes out_len = raw_len / 2
│
├─ ext calls PyArray_EMPTY(out_len, complex64) ← one malloc, every call
│
├─ calls ci8_conv_convert(state, raw.data, raw_len, out.data)
│ └─ no allocation; fills out[] in place
│
└─ returns ndarray to caller
ownership: caller
lifetime: indefinite — object holds no reference to it
This is identical ownership to pattern 1 and 2 — a fresh array each call, owned by the caller, safe to hold indefinitely.
Choosing between the six patterns
Does output count equal input count?
├─ Yes, and input is one sample → use step() + auto steps() (§1)
│
├─ Yes, but a method has a different return type → use jm method (§2)
│
├─ No → is the maximum output count knowable at init time?
│ ├─ Yes, one stream → --variable-output (§3)
│ └─ Yes, N streams → --variable-output --multi-output (§4)
│
├─ Primary op takes a whole buffer → --arg-type type[] (§5)
│ (no steps() generated; step() accepts NDArray directly)
│
└─ Array input, typed output, length = input_len / N → --out-type (§6)
(output array allocated per call; different type from input)