The Python Oracle

how to create dummy variables in Pandas when columns can have mixed types?

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Beneath the City Looping

--

Chapters
00:00 How To Create Dummy Variables In Pandas When Columns Can Have Mixed Types?
00:48 Answer 1 Score 2
01:49 Accepted Answer Score 3
02:10 Thank you

--

Full question
https://stackoverflow.com/questions/3637...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #pandas

#avk47



ACCEPTED ANSWER

Score 3


Two ways you could do

In [37]: pd.to_numeric(df.A, errors='coerce').notnull() & (df.A > 0)
Out[37]:
0     True
1     True
2    False
3    False
4    False
Name: A, dtype: bool

In [38]: df.A.apply(np.isreal) & (df.A > 0)
Out[38]:
0     True
1     True
2    False
3    False
4    False
Name: A, dtype: bool

Third could perhaps be slow

In [39]: df.A.str.isnumeric().isnull() & (df.A > 0)
Out[39]:
0     True
1     True
2    False
3    False
4    False
Name: A, dtype: bool



ANSWER 2

Score 2


Update: @JohnGalt pointed out in the comments a better way would be to use pd.to_numeric with errors='coerce':

# Your condition here, instead of `> 0`, using the fact that NaN > 0 == false
[18]: df['dummy1'] = (pd.to_numeric(df.A, errors='coerce').notnull() > 0).astype('int')
[19]: df
Out[19]:
     A  dummy1
0    1       1
1    2       1
2   -1       0
3  NaN       0
4   rh       0

The best way One general way to create such dummy variables will be along these lines:

def foo(a):
    try:
        tmp = int(a)
        return 1 if tmp > 0 else 0 # Your condition here.
    except:
        return 0

[12]: df.A.map(foo)
Out[12]:
0    1
1    1
2    1
3    0
4    0
Name: A, dtype: int64

You are doing the operations in Python 2.7, where comparisons between str and int are (unfortunately) allowed. The operations fail on Python 3:

 [5]: df.A > 0
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-890e73655a37> in <module>()
----> 1 df.A > 0

/home/utkarshu/miniconda3/envs/py35/lib/python3.5/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
    724                 other = np.asarray(other)
    725
--> 726             res = na_op(values, other)
    727             if isscalar(res):
    728                 raise TypeError('Could not compare %s type with Series'

/home/utkarshu/miniconda3/envs/py35/lib/python3.5/site-packages/pandas/core/ops.py in na_op(x, y)
    646                     result = lib.vec_compare(x, y, op)
    647             else:
--> 648                 result = lib.scalar_compare(x, y, op)
    649         else:
    650

pandas/lib.pyx in pandas.lib.scalar_compare (pandas/lib.c:14186)()

TypeError: unorderable types: str() > int()