EmotionFlow

Multimodal ML · CV + Audio

ACTIVE2026

live demo — webcam + waveform

modalities fused

real-time

inference

playlist

output

overview.md

EmotionFlow fuses two signals most mood apps ignore: the expression on your face and the tone of your voice. A computer-vision branch reads facial affect frame by frame while an audio branch listens for prosodic cues, and the two are combined into a single, stable emotion estimate that drives playlist curation through the Spotify API.

problem.txt

Single-modality emotion models are brittle. A face can be neutral while a voice is clearly stressed; lighting or a turned head wipes out a vision-only read. Most consumer mood tools also stop at a label — they never do anything with it. I wanted something robust enough to trust and useful enough to actually run.

architecture.drawio

Capture — a live webcam + microphone stream is sampled into short synchronized windows.
Vision branch — face detection + a CNN classifier estimates per-frame facial affect.
Audio branch — prosodic features feed a model that reads vocal emotional tone.
Fusion — the two streams are combined and temporally smoothed into one calibrated emotion signal.
Action — the mood maps to a Spotify playlist request through a FastAPI service.

challenges.log

Keeping vision and audio windows aligned in real time without the UI stalling.
Smoothing per-frame predictions so the output doesn't flicker between states.
Designing a fusion rule that trusts the more confident modality instead of naive averaging.

results.json

modalities

vision + audio

latency target

real-time stream

downstream

Spotify playlist curation

lessons-learned.md

Fusion beats a bigger single-modality model — disagreement between streams is itself signal.
Temporal smoothing matters more than raw per-frame accuracy for anything a human watches.
Shipping the result into a product (a playlist) exposed assumptions a notebook never would.

future-work.md

Add a text/transcript modality for spoken sentiment.
On-device inference to drop the round-trip latency.
Personalized mood → music mapping that learns from skips.

← all work next: Monte-Carlo Risk Surrogates →