OceanIQ unified fragmented Indian oceanographic datasets — CMLRE records, Angria Bank surveys, marine-mammal sightings — into one clean, analysis-ready schema. I went in expecting to spend my time on modelling. I spent almost all of it on plumbing, and I'm glad I did.
Three datasets, three universes
Each source had its own format, units, naming conventions, and granularity. Not slightly different — genuinely incompatible. One was tabular and tidy; another was survey data at a totally different resolution; the third was sparse, irregular sightings. Calling them 'the data' implied a unity that did not exist.
There was no shared key
The thing that breaks every merge tutorial: there was no common identifier to join on. Reconciliation had to be inferred from the fields that did line up, and validated by hand often enough to trust it. This is the unglamorous reality of real-world data work — the join is a modelling decision, not a one-liner.
When the ground truth itself is messy, 'clean' stops being obvious and becomes something you have to define and defend.
Schema is the product
The single most important artifact wasn't a model or a notebook — it was the target schema. Once I'd decided what a unified record should look like, every other step had a clear job: map this source into that shape. Designing the schema well made the rest of the pipeline almost mechanical. Designing it badly would have poisoned everything downstream.
Where AI actually helped
AI-assisted cleaning earned its place in the matching and normalisation steps — the fuzzy, judgment-heavy work of deciding that two differently-spelled things are the same thing. But it needed guardrails and spot-checks. Automated cleaning that you don't audit is just a faster way to be wrong at scale.
Closing the loop with a use case
OceanIQ was built around a business case study on ocean-data accessibility, and that constraint saved the project. A data project without a question to answer expands forever. The case study told me which fields mattered and when the schema was good enough to stop.
I came out of it believing something I'll repeat for a while: in real-world ML, the data engineering is the work. The model rides on top of decisions you made weeks earlier.