Comparing the speed of startswith() .vs. in()
--
Music by Eric Matyas
https://www.soundimage.org
Track title: Puzzling Curiosities
--
Chapters
00:00 Question
02:29 Accepted answer (Score 11)
03:32 Answer 2 (Score 6)
04:13 Answer 3 (Score 0)
04:53 Thank you
--
Full question
https://stackoverflow.com/questions/4452...
Accepted answer links:
[cmp_outcome]: https://github.com/python/cpython/blob/m...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #performance #python3x #time
#avk47
ACCEPTED ANSWER
Score 13
This is due to the fact that you have to look-up and invoke a method. in is specialized and leads directly to COMPARE_OP (calling cmp_outcome which, in turn, calls PySequence_Contains) while str.startswith goes through slower byte-code:
2 LOAD_ATTR 0 (startswith)
4 LOAD_FAST 1 (word)
6 CALL_FUNCTION 1 # the slow part
Replacing in with __contains__, forcing a function call for that case too, pretty much negates the speed difference:
setup1='''
def in_test(sent, word):
if sent.__contains__(word):
return True
else:
return False
'''
And, the timings:
print(timeit.timeit('in_test("this is a standard sentence", "this")', setup=setup1))
print(timeit.timeit('startswith_test("this is a standard sentence", "this")', setup=setup2))
0.43849368393421173
0.4993997460696846
in is winning here because of the fact that it doesn't need to go through the whole function call setup and due to the favorable case it's presented with.
ANSWER 2
Score 6
You're comparing an operator on strings -vs- an attribute lookup and a function call. The second one will have a higher overhead, even if the first one takes a long time on a lot of data.
Additionally you're looking for the first word, so if it does match, in will look at just as much data as startswith(). To see the difference you should look at a pessimistic case (no results found, or match at the end of the string):
setup1='''
data = "xxxx"*1000
def ....
print(timeit.timeit('in_test(data, "this")', setup=setup1))
0.932795189000899
print(timeit.timeit('startswith_test(data, "this")', setup=setup2))
0.22242475600069156
ANSWER 3
Score 0
If you look at bytecode produced by your functions:
>>> dis.dis(in_test)
2 0 LOAD_FAST 1 (word)
3 LOAD_FAST 0 (sent)
6 COMPARE_OP 6 (in)
9 POP_JUMP_IF_FALSE 16
3 12 LOAD_CONST 1 (True)
15 RETURN_VALUE
5 >> 16 LOAD_CONST 2 (False)
19 RETURN_VALUE
20 LOAD_CONST 0 (None)
23 RETURN_VALUE
you'll notice there is much overhead not directly related to string matching. Doing the test on a simpler function:
def in_test(sent, word):
return word in sent
will be more reliable.