The Python Oracle

How to conditionally copy a substring into a new column of a pandas dataframe?

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: Mysterious Puzzle

--

Chapters
00:00 Question
01:35 Accepted answer (Score 3)
02:23 Thank you

--

Full question
https://stackoverflow.com/questions/4585...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #string #pandas #dataframe #substring

#avk47



ACCEPTED ANSWER

Score 3


You can use pd.Series.str.extract:

In [737]: df
Out[737]: 
         A                                   B
0    VALID       asdfafX'XextractthisY'Yeaaadf
1  INVALID         secondrowX'XsubtextY'Yelakj
2    VALID  secondrowX'XextractthistooY'Yelakj

In [745]: df['C'] = df[df.A == 'VALID'].B.str.extract("(?<=X'X)(.*?)(?=Y'Y)", expand=False)

In [746]: df
Out[746]: 
         A                                   B               C
0    VALID       asdfafX'XextractthisY'Yeaaadf     extractthis
1  INVALID         secondrowX'XsubtextY'Yelakj             NaN
2    VALID  secondrowX'XextractthistooY'Yelakj  extractthistoo

The regex pattern is:

(?<=X'X)(.*?)(?=Y'Y)
  • (?<=X'X) is a lookbehind for X'X

  • (.*?) matches everything between the lookbehind and lookahead

  • (?=Y'Y) is a lookahead for Y'Y