numpy vectorized: check if strings in array end with strings in another array

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: Lost Meadow

--

Chapters
00:00 Question
01:15 Accepted answer (Score 3)
02:00 Answer 2 (Score 1)
02:49 Thank you

--

Full question
https://stackoverflow.com/questions/4449...

Answer 1 links:
[NumPy broadcasting]: https://docs.scipy.org/doc/numpy-1.12.0/...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #arrays #string #numpy

#avk47

ACCEPTED ANSWER

Score 3

Numpy has these kind of operations for chararrays: numpy.core.defchararray.endswith().

The following bit of code speeds things up quite a bit, but it does take a lot of memory as you create two arrays of the same size as your output array:

A = np.array(['val1', 'val2', 'val3'])
B = np.array(['1', '2', 'al1'])

A_matrix = np.repeat(A[:, np.newaxis], len(B), axis=1)
B_matrix = np.repeat(B[:, np.newaxis], len(A), axis=1).transpose()

result = np.core.defchararray.endswith(A_matrix, B_matrix)

Update:
As Divakar noted, the above code can be consolidated into:

A = np.array(['val1', 'val2', 'val3'])
B = np.array(['1', '2', 'al1'])

np.core.defchararray.endswith(A[:,None], B)

ANSWER 2

Score 1

Here's an almost* vectorized approach making use of NumPy broadcasting -

# Get lengths of strings in each array
lens_strings = np.array(list(map(len,strings)))
lens_ends = np.array(list(map(len,ends)))

# Get the right most index of match, add the ends strings.
# The matching ones would cover the entire lengths of strings.
# So, do a final comparison against those lengths.
rfind = np.core.defchararray.rfind
out = rfind(strings[:,None], ends) + lens_ends == lens_strings[:,None]

Sample run -

In [224]: strings = np.array(['val1', 'val2', 'val3', 'val1y', 'val341'])
     ...: ends = np.array(['1', '2', 'al1', 'l2'])
     ...: 

In [225]: out
Out[225]: 
array([[ True, False,  True, False],
       [False,  True, False,  True],
       [False, False, False, False],
       [False, False, False, False],
       [ True, False, False, False]], dtype=bool)

*Almost because of the use of map, but since we are only using it to get the lengths of the strings of the input elements, its cost must be minimal in comparison to the other operations needed to solve our case.