The Python Oracle

Reading specific partitions from a partitioned parquet dataset with pyarrow

--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Drifting Through My Dreams

--

Chapters
00:00 Reading Specific Partitions From A Partitioned Parquet Dataset With Pyarrow
01:52 Accepted Answer Score 6
02:11 Answer 2 Score 14
02:27 Thank you

--

Full question
https://stackoverflow.com/questions/4800...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #parquet #pyarrow #apachearrow

#avk47



ANSWER 1

Score 14


As of pyarrow version 0.10.0 you can use filters kwarg to do the query. In your case it would look like something like this:

import pyarrow.parquet as pq
dataset = pq.ParquetDataset('path-to-your-dataset', filters=[('part2', '=', 'True'),])
table = dataset.read()

Ref




ACCEPTED ANSWER

Score 6


Question: How do I read specific partitions from a partitioned parquet dataset with pyarrow?

Answer: You can't right now.

Can you create an Apache Arrow JIRA requesting this feature on https://issues.apache.org/jira?

This is something that we should be able to support in the pyarrow API but it will require someone to implement it. Thank you