OceanIQ

AI Data Platform · Pipeline

MVP2025

unified schema diagram

sources unified

1 schema

output

case study

built for

overview.md

Indian ocean data lives in incompatible silos — CMLRE records, Angria Bank surveys, marine-mammal sightings — each with its own format and quirks. OceanIQ is the pipeline that ingests those fragments, reconciles them, and lands a single clean schema you can actually query, built around a business case study on ocean-data accessibility.

problem.txt

The data exists, but it's unusable: different formats, units, naming, and granularity per source. Any analysis starts with weeks of manual cleaning. The goal was to make 'ask a question of all of it at once' possible.

architecture.drawio

Ingest — readers for each source (CMLRE, Angria Bank, marine-mammal surveys).
Normalise — units, naming, and types are standardised per field.
Reconcile — records are aligned and merged with AI-assisted matching.
Land — everything resolves into one analysis-ready schema.

dataset.csv

Heterogeneous public Indian oceanographic sources — CMLRE datasets, Angria Bank surveys, and marine-mammal sighting records — each arriving in its own format and resolution.

source	type	challenge
CMLRE	tabular records	naming + units
Angria Bank	survey data	granularity
Marine mammals	sightings	sparse / irregular

challenges.log

No shared key across sources — reconciliation had to be inferred.
Silent format drift inside a single 'source' over time.
Deciding what 'clean' means when ground truth itself is messy.

lessons-learned.md

Schema design is the real product — the model is downstream of it.
AI-assisted cleaning saves time but needs guardrails and spot-checks.
A use case (the case study) keeps a data project from boiling the ocean.

future-work.md

Incremental ingestion as new survey data lands.
A query UI on top of the unified schema.
Data-quality scoring per record.

← all work next: ExplainMyRepo →