pandas create new column based on values from other columns / apply a function of multiple columns, row-wise
pandas create new column based on values from other columns / apply a function of multiple columns, row-wise
--
Become part of the top 3% of the developers by applying to Toptal
https://topt.al/25cXVn
--
Music by Eric Matyas
https://www.soundimage.org
Track title: Over a Mysterious Island
--
Chapters
00:00 Question
02:20 Accepted answer (Score 707)
04:01 Answer 2 (Score 399)
05:00 Answer 3 (Score 132)
06:51 Answer 4 (Score 42)
07:12 Thank you
--
Full question
https://stackoverflow.com/questions/2688...
Answer 1 links:
[https://stackoverflow.com/a/12555510/243...]: https://stackoverflow.com/a/12555510/243...
[apply]: https://pandas.pydata.org/pandas-docs/st...
[assign]: https://pandas.pydata.org/pandas-docs/st...
Answer 2 links:
Answer 3 links:
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #pandas #dataframe #numpy #apply
ACCEPTED ANSWER
Score 778
OK, two steps to this - first is to write a function that does the translation you want - I've put an example together based on your pseudo-code:
def label_race(row):
if row['eri_hispanic'] == 1:
return 'Hispanic'
if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1:
return 'Two Or More'
if row['eri_nat_amer'] == 1:
return 'A/I AK Native'
if row['eri_asian'] == 1:
return 'Asian'
if row['eri_afr_amer'] == 1:
return 'Black/AA'
if row['eri_hawaiian'] == 1:
return 'Haw/Pac Isl.'
if row['eri_white'] == 1:
return 'White'
return 'Other'
You may want to go over this, but it seems to do the trick - notice that the parameter going into the function is considered to be a Series object labelled "row".
Next, use the apply function in pandas to apply the function - e.g.
df.apply(label_race, axis=1)
Note the axis=1 specifier, that means that the application is done at a row, rather than a column level. The results are here:
0 White
1 Hispanic
2 White
3 White
4 Other
5 White
6 Two Or More
7 White
8 Haw/Pac Isl.
9 White
If you're happy with those results, then run it again, saving the results into a new column in your original dataframe.
df['race_label'] = df.apply(label_race, axis=1)
The resultant dataframe looks like this (scroll to the right to see the new column):
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian eri_hispanic eri_nat_amer eri_white rno_defined race_label
0 MOST JEFF E 0 0 0 0 0 1 White White
1 CRUISE TOM E 0 0 0 1 0 0 White Hispanic
2 DEPP JOHNNY NaN 0 0 0 0 0 1 Unknown White
3 DICAP LEO NaN 0 0 0 0 0 1 Unknown White
4 BRANDO MARLON E 0 0 0 0 0 0 White Other
5 HANKS TOM NaN 0 0 0 0 0 1 Unknown White
6 DENIRO ROBERT E 0 1 0 0 0 1 White Two Or More
7 PACINO AL E 0 0 0 0 0 1 White White
8 WILLIAMS ROBIN E 0 0 1 0 0 0 White Haw/Pac Isl.
9 EASTWOOD CLINT E 0 0 0 0 0 1 White White
ANSWER 2
Score 442
Since this is the first Google result for 'pandas new column from others', here's a simple example:
import pandas as pd
# make a simple dataframe
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
df
# a b
# 0 1 3
# 1 2 4
# create an unattached column with an index
df.apply(lambda row: row.a + row.b, axis=1)
# 0 4
# 1 6
# do same but attach it to the dataframe
df['c'] = df.apply(lambda row: row.a + row.b, axis=1)
df
# a b c
# 0 1 3 4
# 1 2 4 6
If you get the SettingWithCopyWarning you can do it this way also:
col = df.apply(lambda row: row.a + row.b, axis=1)
df = df.assign(c=col.values) # assign values to column 'c'
Source: https://stackoverflow.com/a/12555510/243392
And if your column name includes spaces you can use syntax like this:
df = df.assign(**{'some column name': col.values})
ANSWER 3
Score 143
The answers above are perfectly valid, but a vectorized solution exists, in the form of numpy.select. This allows you to define conditions, then define outputs for those conditions, much more efficiently than using apply:
First, define conditions:
conditions = [
df['eri_hispanic'] == 1,
df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
df['eri_nat_amer'] == 1,
df['eri_asian'] == 1,
df['eri_afr_amer'] == 1,
df['eri_hawaiian'] == 1,
df['eri_white'] == 1,
]
Now, define the corresponding outputs:
outputs = [
'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]
Finally, using numpy.select:
res = np.select(conditions, outputs, 'Other')
pd.Series(res)
0 White
1 Hispanic
2 White
3 White
4 Other
5 White
6 Two Or More
7 White
8 Haw/Pac Isl.
9 White
dtype: object
Why should numpy.select be used over apply? Here are some performance checks:
df = pd.concat([df]*1000)
In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [44]: %%timeit
...: conditions = [
...: df['eri_hispanic'] == 1,
...: df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
...: df['eri_nat_amer'] == 1,
...: df['eri_asian'] == 1,
...: df['eri_afr_amer'] == 1,
...: df['eri_hawaiian'] == 1,
...: df['eri_white'] == 1,
...: ]
...:
...: outputs = [
...: 'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
...: ]
...:
...: np.select(conditions, outputs, 'Other')
...:
...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using numpy.select gives us vastly improved performance, and the discrepancy will only increase as the data grows.
ANSWER 4
Score 43
.apply() takes in a function as the first parameter; pass in the label_race function as so:
df['race_label'] = df.apply(label_race, axis=1)
You don't need to make a lambda function to pass in a function.