These are working notes, not a tutorial. They're the things that actually changed how I build after EmotionFlow and a stack of smaller experiments. If you're starting on multimodal work, maybe they save you a week.
Fusion is a design space, not a step
'Fuse the modalities' sounds like one box in a diagram. It's actually a whole set of choices: do you fuse early (combine raw features) or late (combine predictions)? Do you weight by confidence? Do you let one modality gate another? Each choice has a different failure personality. Treat fusion as the main design problem, not the connective tissue.
- Early fusion: rich, but one bad stream contaminates everything.
- Late fusion: robust, but you lose cross-modal interactions.
- Confidence-weighted: my default — cheap and surprisingly strong.
Alignment eats your time
Two streams sampled at different rates, arriving at different times, is the real engineering tax of multimodal work. Getting vision and audio windows to line up in real time was harder than either model. Budget for it.
Confidence beats accuracy
A model that knows when it doesn't know is worth more than a slightly more accurate one that's always sure. Calibrated confidence is what lets you fuse intelligently and fall back gracefully. I now treat 'how do I get a trustworthy confidence out of this?' as a first-class question.
Accuracy tells you how often you're right. Confidence tells you when to trust yourself. The second one ships.
Read the failure cases
The single highest-leverage habit: stare at the cases where the system was wrong, especially where it was confidently wrong. Almost every real improvement I've made came from one afternoon of watching the model fail and asking why, not from another epoch of training.