Pandas read_hdf: how to get column names when using chunksize or iterator?

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: Realization

--

Chapters
00:00 Question
01:28 Accepted answer (Score 2)
02:31 Thank you

--

Full question
https://stackoverflow.com/questions/4800...

Accepted answer links:
[HDFStore.select]: https://pandas.pydata.org/pandas-docs/st...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #pandas #numpy #hdf5 #chunking

#avk47

ACCEPTED ANSWER

Score 2

Have you tried loading the HDF5 file as an HDFStore? That would allow you to use the HDFStore.select method which may do what you want (with seeking, etc.). You can use select to only operate on a subset of columns too. To me it just looks like it provides more flexibility than the read_hdf function. The following might help as long as you know the structure of your HDF5 file:

store = pd.HDFStore('/path/to/file', 'r')
colnames = store.select('table_key', stop=1).columns

# iterate over table chunks
chunksize = 100000
chunks = store.select('table_key', chunksize=chunksize)
for chunk in chunks:
    ...code...

# select 1 specific chunk as iterator
chunksize = 100000
start, stop = 300*chunksize, 301*chunksize
this_chunk = store.select('table_key', start=start, stop=stop, iterator=True)
do_work(this_chunk)

store.close()

Note that you can also open an HDFStore as a context manager, e.g.,

with pd.HDFStore('/path/to/file', 'r') as store:
    ...code...