The Python Oracle

numpy vectorized: check if strings in array end with strings in another array

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: The World Wide Mind

--

Chapters
00:00 Numpy Vectorized: Check If Strings In Array End With Strings In Another Array
00:59 Accepted Answer Score 3
01:33 Answer 2 Score 1
02:10 Thank you

--

Full question
https://stackoverflow.com/questions/4449...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #arrays #string #numpy

#avk47



ACCEPTED ANSWER

Score 3


Numpy has these kind of operations for chararrays: numpy.core.defchararray.endswith().

The following bit of code speeds things up quite a bit, but it does take a lot of memory as you create two arrays of the same size as your output array:

A = np.array(['val1', 'val2', 'val3'])
B = np.array(['1', '2', 'al1'])

A_matrix = np.repeat(A[:, np.newaxis], len(B), axis=1)
B_matrix = np.repeat(B[:, np.newaxis], len(A), axis=1).transpose()

result = np.core.defchararray.endswith(A_matrix, B_matrix)

Update:
As Divakar noted, the above code can be consolidated into:

A = np.array(['val1', 'val2', 'val3'])
B = np.array(['1', '2', 'al1'])

np.core.defchararray.endswith(A[:,None], B)



ANSWER 2

Score 1


Here's an almost* vectorized approach making use of NumPy broadcasting -

# Get lengths of strings in each array
lens_strings = np.array(list(map(len,strings)))
lens_ends = np.array(list(map(len,ends)))

# Get the right most index of match, add the ends strings.
# The matching ones would cover the entire lengths of strings.
# So, do a final comparison against those lengths.
rfind = np.core.defchararray.rfind
out = rfind(strings[:,None], ends) + lens_ends == lens_strings[:,None]

Sample run -

In [224]: strings = np.array(['val1', 'val2', 'val3', 'val1y', 'val341'])
     ...: ends = np.array(['1', '2', 'al1', 'l2'])
     ...: 

In [225]: out
Out[225]: 
array([[ True, False,  True, False],
       [False,  True, False,  True],
       [False, False, False, False],
       [False, False, False, False],
       [ True, False, False, False]], dtype=bool)

*Almost because of the use of map, but since we are only using it to get the lengths of the strings of the input elements, its cost must be minimal in comparison to the other operations needed to solve our case.