Pandas - retrieving HDF5 columns and memory usage
--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Puzzle Game 5 Looping
--
Chapters
00:00 Pandas - Retrieving Hdf5 Columns And Memory Usage
01:20 Accepted Answer Score 5
02:26 Thank you
--
Full question
https://stackoverflow.com/questions/2590...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #pandas #pytables #h5py
#avk47
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Puzzle Game 5 Looping
--
Chapters
00:00 Pandas - Retrieving Hdf5 Columns And Memory Usage
01:20 Accepted Answer Score 5
02:26 Thank you
--
Full question
https://stackoverflow.com/questions/2590...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #pandas #pytables #h5py
#avk47
ACCEPTED ANSWER
Score 5
HDFStore in table format is a row oriented store. When selecting the query indexes on the rows, but for each row you get every column. selecting a subset of columns does a reindex at the end.
There are several ways to approach this:
- use a column store, like bcolz; this is currently not implemented by PyTables so this would involve quite a bit of work
- chunk thru the table, see here and concat at the end - this will use constant memory
- store as a fixed format - this is a more efficient storage format so will use less memory (but cannot be appended)
- create your own column store-like by storing to multiple sub tables and use select_as_multiple see here
which options you choose depend on the nature of your data access
note: you may not want to have all of the columns as data_columns unless you are really going to select from the all (you can only query ON a data_column or an index) this will make store/query faster