3.1 KiB
Design Rationale
This document explains why Auto Clip makes certain musical and engineering choices.
Why “bars” instead of arbitrary seconds?
Electronic music (especially trance) is structured in bars and phrases:
- Most tracks are in 4/4
- Changes happen on predictable boundaries (every 4/8/16/32 bars)
Cutting on bar boundaries reduces:
- awkward mid-kick edits
- off-grid transitions
- “why does this feel wrong?” moments
That’s why the CLI exposes:
--bars(2 bars for rollcall, 4 bars for mini-mix feel)--preroll-bars(start a bar earlier so the listener hears the groove before the highlight)
Why “pre-roll bars”?
Highlights often occur at an impact moment:
- a stab
- a fill
- a drop hit
If you cut exactly at the highlight, the listener misses the lead-in groove. Pre-roll gives the ear context, so the transition feels like a DJ brought it in.
Practical defaults:
- Rollcall:
--bars 2 --preroll-bars 1 - Mini-mix:
--bars 4 --preroll-bars 1
Why energy + onset for highlight detection?
In EDM, “interesting” moments correlate with:
- higher RMS energy (loudness/drive)
- strong transient activity (onset strength)
A simple weighted sum (with robust normalization) is:
- fast
- local-only
- works reasonably across many tracks
It’s not perfect (pads/breakdowns can confuse it), but it’s a strong baseline.
Why Camelot (harmonic mixing)?
DJ transitions feel smoother when keys are compatible. The Camelot wheel provides a practical rule-of-thumb:
- Same number A<->B (relative major/minor)
- Same letter, number +/-1 (adjacent harmonies)
Auto Clip uses best-effort key detection and then maps to Camelot to:
- reduce harmonic clashes
- keep the teaser musically “coherent”
Caveats:
- Key detection can be unreliable on pad-heavy sections, noise, or breakdowns
- That’s why V3 calls it best-effort and V4 plans confidence-based fallback
Why “downbeat-ish” snap instead of full ML downbeat detection?
True downbeat detection often needs:
- trained ML models
- more complex pipelines
- sometimes stems / better separation
Auto Clip stays local and lightweight. So we approximate downbeat by:
- beat tracking grid
- onset accent scoring at bar starts (kick/transient emphasis)
This typically yields:
- better bar-aligned cuts than “nearest beat”
- without heavy dependencies
Why 2-pass loudnorm?
When you cut from different tracks:
- perceived loudness can jump wildly
- the teaser feels amateur even if the edits are good
FFmpeg’s loudnorm supports 2-pass measurement + apply, which:
- improves consistency
- reduces clipping risk
- keeps the teaser “radio ready” (for a promo)
That’s why V3 uses 2-pass loudnorm per clip.
Why this repo has V_1 / V_2 / V_3?
Keeping versions side-by-side has benefits:
- V_1: minimal baseline
- V_2: practical CLI + selection features
- V_3: trance/DJ quality logic
It also makes it easy for contributors to:
- understand evolution
- debug regressions
V4 aims to unify this into a single stable CLI while retaining clarity.