Pandas: replacement with boolean values gives inconsistent results
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Quiet Intelligence
--
Chapters
00:00 Pandas: Replacement With Boolean Values Gives Inconsistent Results
00:51 Accepted Answer Score 3
02:29 Thank you
--
Full question
https://stackoverflow.com/questions/5078...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #pandas #boolean
#avk47
ACCEPTED ANSWER
Score 3
Pandas stores series internally as NumPy arrays. When a series has mixed types, Pandas / NumPy has to make a decision: it chooses a type which encompasses all types in that series. As a trivial example, if you have a series of integers with type int and change a single value to float, your series will become type float.
In this example, your 0th and 2nd series have NaN values. Now NaN, or np.nan is considered float (try type(np.nan), this will return float), while True / False are considered Boolean. The only way NumPy can store these values is using dtype object, which is just a bunch of pointers (much like a list).
Your 1st column, on the other hand, only has Boolean values and can be stored with type bool. The benefit here is since you aren't using a collection of pointers NumPy can allocate a contiguous memory block for this array. This will yield performance benefits relative to an object series or list.
You can test all the above yourself. Here are some examples:
s1 = pd.Series([True, False])
print(s1.dtype)  # bool
s2 = pd.Series([True, False, np.nan])
print(s2.dtype)  # object
s3 = pd.Series([True, False, 0, 1])
print(s3.dtype)  # object
The final example is interesting because in Python True == 1 and False == 0 both return True because bool can be considered a subclass of int. Therefore, internally, Pandas / NumPy has made a decision to not enforce this equality and choose one or the other. The corollary of this is that you are advised to check the type of your series when dealing with mixed types.
Note also that Pandas performs checks on dtypes when you update values:
s1 = pd.Series([True, 5.4])
print(s1.dtype)  # object
s1.iloc[-1] = False
print(s1.dtype)  # bool