python - HDF5 possible data corruption or loss? -


on wikipedia 1 can read following criticism hdf5:

criticism of hdf5 follows monolithic design , lengthy specification. though 150-page open standard, there single c implementation of hdf5, meaning bindings share bugs , performance issues. compounded lack of journaling, documented bugs in current stable release capable of corrupting entire hdf5 databases. although 1.10-alpha adds journaling, backwards-incompatible previous versions. hdf5 not support utf-8 well, necessitating ascii in places. furthermore in latest draft, array data can never deleted.

i wondering if applying c implementation of hdf5 or if general flaw of hdf5?

i doing scientific experiments generate gigabytes of data , in cases @ least several hundred megabytes of data. data loss , corruption huge disadvantage me.

my scripts have python api, hence using h5py (version 2.5.0).

so, criticism relevant me , should concerned corrupted data?

declaration front: maintain h5py, have bias etc.

the wikipedia page has changed since question posted, here's see:

criticism

criticism of hdf5 follows monolithic design , lengthy specification.

  • though 150-page open standard, other c implementation of hdf5 hdf5 reader.
  • hdf5 not enforce use of utf-8, client applications may expecting ascii in places.
  • dataset data cannot freed in file without generating file copy using external tool (h5repack).

i'd pretty sums problems hdf5, it's complex (but people need complexity, see virtual dataset support), it's got long history backwards compatibly it's focus, , it's not designed allow massive changes in files. it's not best on windows (due how deals filenames).

i picked hdf5 research because of available options, had decent metadata support (hdf5 @ least allows utf-8, formats fits don't have that), support multidimensional arrays (which formats protocol buffers don't support), , supports more 64 bit floats (which rare).

i can't comment known bugs, have seen corruption (this happened when writing file , linux oom'd script). however, shouldn't concern long have proper data hygiene practices (as mentioned in hackernews link), in case not continuously write same file, each run create new file. should not modify file, instead data reduction should produce new files, , should backup originals.

finally, worth pointing out there alternatives hdf5, depending on requirements are: sql databases may fit needs better (and sqlite comes python default, it's easy experiment with), simple csv file. recommend against custom/non-portable formats (e.g. pickle , similar), they're neither more robust hdf5, , more complex csv file.


Comments

Popular posts from this blog

java - Run spring boot application error: Cannot instantiate interface org.springframework.context.ApplicationListener -

python - pip wont install .WHL files -

Excel VBA "Microsoft Windows Common Controls 6.0 (SP6)" Location Changes -