The Python Oracle

Pandas read_hdf: how to get column names when using chunksize or iterator?

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: The Builders

--

Chapters
00:00 Pandas Read_hdf: How To Get Column Names When Using Chunksize Or Iterator?
01:10 Accepted Answer Score 2
01:55 Thank you

--

Full question
https://stackoverflow.com/questions/4800...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #pandas #numpy #hdf5 #chunking

#avk47



ACCEPTED ANSWER

Score 2


Have you tried loading the HDF5 file as an HDFStore? That would allow you to use the HDFStore.select method which may do what you want (with seeking, etc.). You can use select to only operate on a subset of columns too. To me it just looks like it provides more flexibility than the read_hdf function. The following might help as long as you know the structure of your HDF5 file:

store = pd.HDFStore('/path/to/file', 'r')
colnames = store.select('table_key', stop=1).columns

# iterate over table chunks
chunksize = 100000
chunks = store.select('table_key', chunksize=chunksize)
for chunk in chunks:
    ...code...

# select 1 specific chunk as iterator
chunksize = 100000
start, stop = 300*chunksize, 301*chunksize
this_chunk = store.select('table_key', start=start, stop=stop, iterator=True)
do_work(this_chunk)

store.close()

Note that you can also open an HDFStore as a context manager, e.g.,

with pd.HDFStore('/path/to/file', 'r') as store:
    ...code...