The Python Oracle

Pandas - retrieving HDF5 columns and memory usage

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: Popsicle Puzzles

--

Chapters
00:00 Question
01:38 Accepted answer (Score 5)
02:58 Thank you

--

Full question
https://stackoverflow.com/questions/2590...

Question links:
https://github.com/pydata/pandas/issues/...

Accepted answer links:
[bcolz]: https://github.com/Blosc/bcolz
[here]: http://pandas.pydata.org/pandas-docs/dev...
[here]: http://pandas.pydata.org/pandas-docs/dev...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #pandas #pytables #h5py

#avk47



ACCEPTED ANSWER

Score 5


HDFStore in table format is a row oriented store. When selecting the query indexes on the rows, but for each row you get every column. selecting a subset of columns does a reindex at the end.

There are several ways to approach this:

  • use a column store, like bcolz; this is currently not implemented by PyTables so this would involve quite a bit of work
  • chunk thru the table, see here and concat at the end - this will use constant memory
  • store as a fixed format - this is a more efficient storage format so will use less memory (but cannot be appended)
  • create your own column store-like by storing to multiple sub tables and use select_as_multiple see here

which options you choose depend on the nature of your data access

note: you may not want to have all of the columns as data_columns unless you are really going to select from the all (you can only query ON a data_column or an index) this will make store/query faster