← work · all projects
emotionflow.md

CASE 01 / 05

EmotionFlow

Multimodal ML · CV + Audio

ACTIVE2026
live demo — webcam + waveform
2
modalities fused
real-time
inference
playlist
output

overview.md

EmotionFlow fuses two signals most mood apps ignore: the expression on your face and the tone of your voice. A computer-vision branch reads facial affect frame by frame while an audio branch listens for prosodic cues, and the two are combined into a single, stable emotion estimate that drives playlist curation through the Spotify API.

problem.txt

Single-modality emotion models are brittle. A face can be neutral while a voice is clearly stressed; lighting or a turned head wipes out a vision-only read. Most consumer mood tools also stop at a label — they never do anything with it. I wanted something robust enough to trust and useful enough to actually run.

architecture.drawio

  1. Capture — a live webcam + microphone stream is sampled into short synchronized windows.
  2. Vision branch — face detection + a CNN classifier estimates per-frame facial affect.
  3. Audio branch — prosodic features feed a model that reads vocal emotional tone.
  4. Fusion — the two streams are combined and temporally smoothed into one calibrated emotion signal.
  5. Action — the mood maps to a Spotify playlist request through a FastAPI service.

challenges.log

  • Keeping vision and audio windows aligned in real time without the UI stalling.
  • Smoothing per-frame predictions so the output doesn't flicker between states.
  • Designing a fusion rule that trusts the more confident modality instead of naive averaging.

results.json

modalities
vision + audio
latency target
real-time stream
downstream
Spotify playlist curation

lessons-learned.md

  • Fusion beats a bigger single-modality model — disagreement between streams is itself signal.
  • Temporal smoothing matters more than raw per-frame accuracy for anything a human watches.
  • Shipping the result into a product (a playlist) exposed assumptions a notebook never would.

future-work.md

  • Add a text/transcript modality for spoken sentiment.
  • On-device inference to drop the round-trip latency.
  • Personalized mood → music mapping that learns from skips.
Aditya Dixit · Jaipur, IndiaSet in IBM Plex Serif & Mono© 2026