DOSSIER · SYS.FLOAT / 03 · operational
> source: github.com/advay-sinha/FloatChat-AIFloatChat
Makes a heavyweight scientific archive answerable in plain language, from a cached store that fits on one machine.
operational overview
FloatChat exists because oceanographic archives are rich and unreadable: NetCDF files, instrument jargon, no interface for a question like 'average salinity near Odisha in March 2023'. The application turns the ARGO float archive into an explorable interface with a chat layer that understands the data's own vocabulary.
architecture
An offline converter prepares the archive — NetCDF into Parquet — so the live application reads a compact cached store. Streamlit drives the interface: Plotly maps, a static Leaflet view, a Cesium 3D globe, depth profiles, Hovmöller diagrams, float-to-float comparisons, raw tables, and export utilities, blended with NOAA buoy metadata and INCOIS mooring overlays. The chat layer parses places, coordinates, time windows, statistics, and nearest-float lookups against the cache, answering with inline charts and aggregated summaries.
constraints
- archive weight — raw NetCDF is too heavy to query interactively; conversion is a precondition, not an optimization
- query vocabulary — questions arrive as geography and oceanography ('PSAL near Lakshadweep'), not as column names
- offline-first — the explorer must answer from cache; external services are enrichment, never dependency
tradeoffs
- pre-converted Parquet cache over live archive queries: NetCDF deserialization overhead exceeded conversational latency under interactive use, so that cost moved out of the request path into an offline step
- rule-parsed queries with LLM enrichment over LLM-first chat: deterministic answers from data, generative answers only at the edges
- Streamlit over a custom frontend: exploration breadth shipped ahead of interface ownership
failure notes
- the fallback chain is explicit and ordered: cached data first; DuckDuckGo snippets when the cache cannot answer; locally generated context only if an Ollama model is actually running — each rung degrades, none pretend
infrastructure
python · streamlit · plotly · leaflet · cesium · netcdf → parquet · ollama (optional)
engineering reasoning
Scientific data systems earn trust by answering from the data, not about it. Keeping the deterministic path (cache, statistics, charts) primary and the generative path explicitly secondary is the same degradation discipline as any backend: visible fallbacks, honest absence.
future work
- broader geographic focus beyond the Indian Ocean defaults
- deeper BGC parameter comparisons