EmotionFlow started as a small, slightly silly idea: what if your music followed your mood without you choosing anything? Look a little tired, get something slower. Sound excited, get something with more drive. The product is almost a toy. The engineering underneath turned out to be the part I keep thinking about.
The premise
The system watches a webcam and listens to a microphone at the same time. A vision branch estimates facial affect frame by frame. An audio branch reads the emotional tone of your voice. Those two signals get fused into one estimate of how you're feeling, and that estimate becomes a request to the Spotify API for a playlist that fits.
Why one modality isn't enough
My first version was vision only, because faces feel like the obvious place to read emotion. It was also embarrassingly fragile. Turn your head, sit in bad lighting, or just have a naturally flat expression, and the model fell apart. Worse, it was confidently wrong — a neutral face would read as a hard 'neutral' even when the person was audibly stressed.
Voice had the opposite failure mode. Prosody carries a lot of emotional information, but it's noisy and silent half the time. Neither stream alone was something I'd trust.
The moment two models disagree is not a problem to smooth over. It's the most informative thing that happens.
Making two models argue
The naive fusion is to average the two predictions. Don't do that. Averaging a confident vision read with an uncertain audio read just drags both toward mush. What worked better was letting the more confident modality lead, and treating disagreement between them as a signal in itself — if your face says calm but your voice says stressed, that tension is real information, not noise to be cancelled out.
Smoothing is the whole game
Per-frame accuracy turned out to matter far less than temporal stability. A model that's 90% accurate but flickers between states every frame feels broken to a human. A model that's 80% accurate but moves slowly and deliberately feels intelligent. I spent more time on the smoothing window than on the classifiers.
What I'd do differently
I'd add a text modality from speech-to-text, because the actual words carry sentiment the tone sometimes hides. I'd move inference on-device to kill the round trip. And I'd let the mood-to-music mapping learn from what you skip, because my hand-tuned mapping is really just my taste in a trench coat.
But the thing I'm taking into the next project is smaller and more durable: build the seam where models meet, not just the models.