The Python Oracle

Memory efficient way to store bool and NaN values in pandas

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Popsicle Puzzles

--

Chapters
00:00 Memory Efficient Way To Store Bool And Nan Values In Pandas
01:07 Accepted Answer Score 8
01:19 Answer 2 Score 3
02:07 Thank you

--

Full question
https://stackoverflow.com/questions/5087...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #python3x #pandas #memory #nan

#avk47



ACCEPTED ANSWER

Score 8


Use dtype: int8

1 = True
0 = False
-1 = NaN

This is 4 times better than float32 and 8 times better than float64




ANSWER 2

Score 3


Building upon the previous answer, it might be worth mentioning that Pandas has an "integer NaN" as of v1.0.0 (pd.NA), whose presence allows a column's dtype to remain an integer. From the linked documentation page:

In Working with missing data, we saw that pandas primarily uses NaN to represent missing data. Because NaN is a float, this forces an array of integers with any missing values to become floating point. In some cases, this may not matter much. But if your integer column is, say, an identifier, casting to float can be problematic. Some integers cannot even be represented as floating point numbers.

This might be slightly more readable than encoding NaNs as some known-to-be-invalid integer value, and of course pd.isna returns True for them.

I do not know what effect this has in terms of memory, compared to a simple integer.