118 lines
3.1 KiB
Markdown
118 lines
3.1 KiB
Markdown
# Design Rationale
|
||
|
||
This document explains *why* Auto Clip makes certain musical and engineering choices.
|
||
|
||
---
|
||
|
||
## Why “bars” instead of arbitrary seconds?
|
||
|
||
Electronic music (especially trance) is structured in **bars** and **phrases**:
|
||
- Most tracks are in **4/4**
|
||
- Changes happen on predictable boundaries (every 4/8/16/32 bars)
|
||
|
||
Cutting on bar boundaries reduces:
|
||
- awkward mid-kick edits
|
||
- off-grid transitions
|
||
- “why does this feel wrong?” moments
|
||
|
||
That’s why the CLI exposes:
|
||
- `--bars` (2 bars for rollcall, 4 bars for mini-mix feel)
|
||
- `--preroll-bars` (start a bar earlier so the listener hears the groove before the highlight)
|
||
|
||
---
|
||
|
||
## Why “pre-roll bars”?
|
||
|
||
Highlights often occur at an impact moment:
|
||
- a stab
|
||
- a fill
|
||
- a drop hit
|
||
|
||
If you cut *exactly* at the highlight, the listener misses the *lead-in groove*.
|
||
Pre-roll gives the ear context, so the transition feels like a DJ brought it in.
|
||
|
||
Practical defaults:
|
||
- Rollcall: `--bars 2 --preroll-bars 1`
|
||
- Mini-mix: `--bars 4 --preroll-bars 1`
|
||
|
||
---
|
||
|
||
## Why energy + onset for highlight detection?
|
||
|
||
In EDM, “interesting” moments correlate with:
|
||
- higher RMS energy (loudness/drive)
|
||
- strong transient activity (onset strength)
|
||
|
||
A simple weighted sum (with robust normalization) is:
|
||
- fast
|
||
- local-only
|
||
- works reasonably across many tracks
|
||
|
||
It’s not perfect (pads/breakdowns can confuse it), but it’s a strong baseline.
|
||
|
||
---
|
||
|
||
## Why Camelot (harmonic mixing)?
|
||
|
||
DJ transitions feel smoother when keys are compatible.
|
||
The Camelot wheel provides a practical rule-of-thumb:
|
||
|
||
- Same number A<->B (relative major/minor)
|
||
- Same letter, number +/-1 (adjacent harmonies)
|
||
|
||
Auto Clip uses **best-effort** key detection and then maps to Camelot to:
|
||
- reduce harmonic clashes
|
||
- keep the teaser musically “coherent”
|
||
|
||
Caveats:
|
||
- Key detection can be unreliable on pad-heavy sections, noise, or breakdowns
|
||
- That’s why V3 calls it best-effort and V4 plans confidence-based fallback
|
||
|
||
---
|
||
|
||
## Why “downbeat-ish” snap instead of full ML downbeat detection?
|
||
|
||
True downbeat detection often needs:
|
||
- trained ML models
|
||
- more complex pipelines
|
||
- sometimes stems / better separation
|
||
|
||
Auto Clip stays local and lightweight.
|
||
So we approximate downbeat by:
|
||
- beat tracking grid
|
||
- onset accent scoring at bar starts (kick/transient emphasis)
|
||
|
||
This typically yields:
|
||
- better bar-aligned cuts than “nearest beat”
|
||
- without heavy dependencies
|
||
|
||
---
|
||
|
||
## Why 2-pass loudnorm?
|
||
|
||
When you cut from different tracks:
|
||
- perceived loudness can jump wildly
|
||
- the teaser feels amateur even if the edits are good
|
||
|
||
FFmpeg’s loudnorm supports 2-pass measurement + apply, which:
|
||
- improves consistency
|
||
- reduces clipping risk
|
||
- keeps the teaser “radio ready” (for a promo)
|
||
|
||
That’s why V3 uses 2-pass loudnorm per clip.
|
||
|
||
---
|
||
|
||
## Why this repo has V_1 / V_2 / V_3?
|
||
|
||
Keeping versions side-by-side has benefits:
|
||
- V_1: minimal baseline
|
||
- V_2: practical CLI + selection features
|
||
- V_3: trance/DJ quality logic
|
||
|
||
It also makes it easy for contributors to:
|
||
- understand evolution
|
||
- debug regressions
|
||
|
||
V4 aims to unify this into a single stable CLI while retaining clarity.
|