Though I take your point that it’s not big data by the conventional use (i.e. requiring a distributed computing to process). The phrasing in the original article was better: “To make iterative analysis practical, we wrote a Julia pipeline: NetCDF source files are converted to Apache Arrow, then thread-parallel bit extraction is performed into a DuckDB database.”
The dataset was 136GB (about 7GB per annum), and the Python implementation took 45 hours for each run. The Julia code that processed the whole dataset and built the database took 5 hours, which made iterative development much more pleasant. Of course, later stages in the pipeline had much less data to process and so were much faster. With metadata and indices, that was about 3GB. It's bigger than your estimate since there are multiple observations of the same satellite.
The code is all available and every claim is traceable back to the statistical analysis. Results are reproducible from the original data which is archived on Zenodo. Further analysis would be very welcome. https://github.com/sjmurdoch/gps-special-messages
I feel this is a bit different. At least O_TRUNC is an option that is shown in documentation right next to the open() function so the programmer has the opportunity to spot it. With the FileSavePicker, there is no such option available and they have to add a line to manually truncate the stream. Also, open() is a low-level call, whereas FileSavePicker is the supposedly easy-to-use high level feature. I would say it is closer to fopen(), which does truncate by default.
reply