The Python Oracle

Filter out everything before a condition is met, keep all elements after

This video explains
Filter out everything before a condition is met, keep all elements after

--

Become part of the top 3% of the developers by applying to Toptal
https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: Lost Meadow

--

Chapters
00:00 Question
01:01 Accepted answer (Score 24)
03:36 Answer 2 (Score 47)
04:41 Answer 3 (Score 8)
06:33 Answer 4 (Score 7)
07:13 Thank you

--

Full question
https://stackoverflow.com/questions/7096...

Accepted answer links:
[@Kelly Bundy]: https://stackoverflow.com/a/71093342/175...
[image]: https://i.stack.imgur.com/x2VfZ.png
[@richardec]: https://stackoverflow.com/a/70965179/175...
[@0x263A]: https://stackoverflow.com/a/70965509/175...
[image]: https://i.stack.imgur.com/w5tyK.png

Answer 2 links:
[itertools.dropwhile]: https://docs.python.org/3/library/iterto...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #list #listcomprehension

#avk47



ANSWER 1

Score 50


You can use itertools.dropwhile:

from itertools import dropwhile

p = [4,9,10,4,20,13,29,3,39]

p = dropwhile(lambda x: x <= 18, p)
print(*p) # 20 13 29 3 39

In my opinion, this is arguably the easiest-to-read version. This also corresponds to a common pattern in other functional programming languages, such as dropWhile (<=18) p in Haskell and p.dropWhile(_ <= 18) in Scala.


Alternatively, using walrus operator (only available in python 3.8+):

exceeded = False
p = [x for x in p if (exceeded := exceeded or x > 18)]
print(p) # [20, 13, 29, 3, 39]

But my guess is that some people don't like this style. In that case, one can do an explicit for loop (ilkkachu's suggestion):

for i, x in enumerate(p):
    if x > 18:
        output = p[i:]
        break
else:
    output = [] # alternatively just put output = [] before for



ACCEPTED ANSWER

Score 23


You could use enumerate and list slicing in a generator expression and next:

out = next((p[i:] for i, item in enumerate(p) if item > 18), [])

Output:

[20, 13, 29, 3, 39]

In terms of runtime, it depends on the data structure.

The plots below show the runtime difference among the answers on here for various lengths of p.

If the original data is a list, then using a lazy iterator as proposed by @Kelly Bundy is the clear winner:

enter image description here

But if the initial data is a ndarray object, then the vectorized operations as proposed by @richardec and @0x263A (for large arrays) are faster. In particular, numpy beats list methods regardless of array size. But for very large arrays, pandas starts to perform better than numpy (I don't know why, I (and I'm sure others) would appreciate it if anyone can explain it).

enter image description here

Code used to generate the first plot:

import perfplot
import numpy as np
import pandas as pd
import random
from itertools import dropwhile

def it_dropwhile(p):
    return list(dropwhile(lambda x: x <= 18, p))

def walrus(p):
    exceeded = False
    return [x for x in p if (exceeded := exceeded or x > 18)]

def explicit_loop(p):
    for i, x in enumerate(p):
        if x > 18:
            output = p[i:]
            break
    else:
        output = []
    return output

def genexpr_next(p):
    return next((p[i:] for i, item in enumerate(p) if item > 18), [])

def np_argmax(p):
    return p[(np.array(p) > 18).argmax():]

def pd_idxmax(p):
    s = pd.Series(p)
    return s[s.gt(18).idxmax():]

def list_index(p):
    for x in p:
        if x > 18:
            return p[p.index(x):]
    return []

def lazy_iter(p):
    it = iter(p)
    for x in it:
        if x > 18:
            return [x, *it]
    return []

perfplot.show(
    setup=lambda n: random.choices(range(0, 15), k=10*n) + random.choices(range(-20,30), k=10*n),
    kernels=[it_dropwhile, walrus, explicit_loop, genexpr_next, np_argmax, pd_idxmax, list_index, lazy_iter],
    labels=['it_dropwhile','walrus','explicit_loop','genexpr_next','np_argmax','pd_idxmax', 'list_index', 'lazy_iter'],
    n_range=[2 ** k for k in range(18)],
    equality_check=np.allclose,
    xlabel='~n/20'
)

Code used to generate the second plot (note that I had to modify list_index because numpy doesn't have index method):

def list_index(p):
    for x in p:
        if x > 18:
            return p[np.where(p==x)[0][0]:]
    return []

perfplot.show(
    setup=lambda n: np.hstack([np.random.randint(0,15,10*n), np.random.randint(-20,30,10*n)]),
    kernels=[it_dropwhile, walrus, explicit_loop, genexpr_next, np_argmax, pd_idxmax, list_index, lazy_iter],
    labels=['it_dropwhile','walrus','explicit_loop','genexpr_next','np_argmax','pd_idxmax', 'list_index', 'lazy_iter'],
    n_range=[2 ** k for k in range(18)],
    equality_check=np.allclose,
    xlabel='~n/20'
)



ANSWER 3

Score 8


Great solutions here; just wanted to demonstrate how to do it with numpy:

>>> import numpy as np
>>> p[(np.array(p) > 18).argmax():]
[20, 13, 29, 3, 39]

Since there are a lot of nice answers here, I decided to run some simple benchmarks. The first one uses the OP's sample array ([4,9,10,4,20,13,29,3,39]) of length 9. The second uses randomly generated array of length 20 thousand, where the first half is between 0 and 15, and the second half is between -20 and 30 (so that the split wouldn't occur right in the center).

Using the OP's data (array of length 9):

%timeit enke()
650 ns ± 15.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit j1lee1()
546 ns ± 4.22 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit j1lee2()
551 ns ± 19 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit j2lee3()
536 ns ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit richardec()
2.08 µs ± 16 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Using an array of length 20,000 (20 thousand):

%timeit enke()
1.5 ms ± 34.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit j1lee1()
1.95 ms ± 43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit j1lee2()
2.1 ms ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit j2lee3()
2.33 ms ± 96.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit richardec()
13.3 µs ± 461 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Code to generate second array:

p = np.hstack([np.random.randint(0,15,10000),np.random.randint(-20,30,10000)])

So, for the small case, numpy is a slug and not needed. But the large case, numpy is almost 100x times faster and the way to go! :)




ANSWER 4

Score 7


I noticed the OP mention under an answer that p is actually a Pandas DataFrame. Here is a method of filtering all elements up to the first instance of a number greater than 18 using Pandas:

import pandas as pd
df = pd.DataFrame([4,9,10,4,20,13,29,3,39])
df = df[df[0].gt(18).idxmax():]
print(df)

Outputs:

    0
4  20
5  13
6  29
7   3
8  39

Note: I'm blind to the actual structure of your DataFrame so I just used exactly what was given.