The Python Oracle

Pandas Return Separate DataFrame Values Based on Function

--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------

Take control of your privacy with Proton's trusted, Swiss-based, secure services.
Choose what you need and safeguard your digital life:
Mail: https://go.getproton.me/SH1CU
VPN: https://go.getproton.me/SH1DI
Password Manager: https://go.getproton.me/SH1DJ
Drive: https://go.getproton.me/SH1CT


Music by Eric Matyas
https://www.soundimage.org
Track title: Flying Over Ancient Lands

--

Chapters
00:00 Pandas Return Separate Dataframe Values Based On Function
01:21 Answer 1 Score 1
02:24 Answer 2 Score 0
03:01 Accepted Answer Score 3
03:29 Thank you

--

Full question
https://stackoverflow.com/questions/5948...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #pandas #dataframe #distance

#avk47



ACCEPTED ANSWER

Score 3


You can use pd.cut function to specify proper intervals in which latitudes are contained and simply merge two dataframes to obtain the result:

bins = [(i-1,i+1) for i in df1['Lat']]
bins = [item for subbins in bins for item in subbins]

df1['Interval'] = pd.cut(df1['Lat'], bins=bins)
df2['Interval'] = pd.cut(df2['Station_Lat'], bins=bins)

pd.merge(df1,df2)

This solution is slightly faster than yours. 10.2 ms ± 201 µs per loop vs 12.2 ms ± 1.34 ms per loop.




ANSWER 2

Score 1


Maybe it is faster:

df2= df2.sort_values("Station_Lat")

After sorting, you can use 'searchsorted":

df1["idx"]=df2.Station_Lat.searchsorted(df1.Lat)

"idx" is the 'nearest' station lat. index, or idx+1 is this. Maybe you need duplicate the last row in df2 (see the "searchsorted doc) to avoid over indexing it. The use "apply" with this custom function:

def dist(row): 
    if  abs(row.Lat-df2.loc[row.idx].Station_Lat)<=1: 
            return df2.loc[row.idx].Station 
    elif abs(row.Lat-df2.loc[row.idx+1].Station_Lat)<=1: 
            return df2.loc[row.idx+1].Station 

    return False 

df1.apply(dist,axis=1)                                                                                               

0      ABC
1    False
2    False
3      JKL
dtype: object

Edit: Because in 'dist()' it is assumed that df2.index is ordered and monotonic increasing (see: roww.idx+1), the 1st code line must be corrected:

df2= df2.sort_values("Station_Lat").reset_index(drop=True)

And 'dist()' is somewhat faster that way (but doesn't beat the Cartesian product method):

def dist(row):  
          idx=row.idx 
          lat1,lat2= df2.loc[idx:idx+1,"Station_Lat"] 
          if  abs(row.Lat-lat1)<=1:  
                 return df2.loc[idx,"Station"] 
          elif abs(row.Lat-lat2)<=1:  
                 return df2.loc[idx+1,"Station"] 
          return False 



ANSWER 3

Score 0


How about a lambda?

df3[df3.apply(lambda x, col1='Lat', col2='Station_Lat': x[col1]-x[col2] >= -1 and x[col1]-x[col2] <= 1, axis=1)]['Station']

Output:

0     ABC
15    JKL

Edit: Here's a second solution. (Note: This also uses abs() since >=-1 and <= 1 seems redundant.)

for i in df1.index:
    for j in df2.index:
        if abs(df1.loc[i, 'Lat'] - df2.loc[j, 'Station_Lat']) <=1:
            print(df2.loc[j, 'Station'])

Or, in list comprehension form:

df2.loc[[i for i in df1.index for j in df2.index if abs(df1.loc[i, 'Lat'] - df2.loc[j, 'Station_Lat']) <=1], 'Station']

Output:

ABC
JKL