FloatChat

Makes a heavyweight scientific archive answerable in plain language, from a cached store that fits on one machine.

operational overview

FloatChat exists because oceanographic archives are rich and unreadable: NetCDF files, instrument jargon, no interface for a question like 'average salinity near Odisha in March 2023'. The application turns the ARGO float archive into an explorable interface with a chat layer that understands the data's own vocabulary.

architecture

An offline converter prepares the archive — NetCDF into Parquet — so the live application reads a compact cached store. Streamlit drives the interface: Plotly maps, a static Leaflet view, a Cesium 3D globe, depth profiles, Hovmöller diagrams, float-to-float comparisons, raw tables, and export utilities, blended with NOAA buoy metadata and INCOIS mooring overlays. The chat layer parses places, coordinates, time windows, statistics, and nearest-float lookups against the cache, answering with inline charts and aggregated summaries.

constraints

archive weight — raw NetCDF is too heavy to query interactively; conversion is a precondition, not an optimization
query vocabulary — questions arrive as geography and oceanography ('PSAL near Lakshadweep'), not as column names
offline-first — the explorer must answer from cache; external services are enrichment, never dependency

tradeoffs

pre-converted Parquet cache over live archive queries: NetCDF deserialization overhead exceeded conversational latency under interactive use, so that cost moved out of the request path into an offline step
rule-parsed queries with LLM enrichment over LLM-first chat: deterministic answers from data, generative answers only at the edges
Streamlit over a custom frontend: exploration breadth shipped ahead of interface ownership

failure notes

the fallback chain is explicit and ordered: cached data first; DuckDuckGo snippets when the cache cannot answer; locally generated context only if an Ollama model is actually running — each rung degrades, none pretend

infrastructure

python · streamlit · plotly · leaflet · cesium · netcdf → parquet · ollama (optional)

engineering reasoning

Scientific data systems earn trust by answering from the data, not about it. Keeping the deterministic path (cache, statistics, charts) primary and the generative path explicitly secondary is the same degradation discipline as any backend: visible fallbacks, honest absence.

future work

broader geographic focus beyond the Indian Ocean defaults
deeper BGC parameter comparisons