What rules does Pandas use to generate a view vs a copy?
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Unforgiving Himalayas Looping
--
Chapters
00:00 What Rules Does Pandas Use To Generate A View Vs A Copy?
01:32 Accepted Answer Score 200
02:50 Answer 2 Score 6
04:07 Thank you
--
Full question
https://stackoverflow.com/questions/2329...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #pandas #dataframe #indexing #chainedassignment
#avk47
ACCEPTED ANSWER
Score 200
Here's the rules, subsequent override:
All operations generate a copy
If
inplace=Trueis provided, it will modify in-place; only some operations support thisAn indexer that sets, e.g.
.loc/.iloc/.iat/.atwill set inplace.An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that's why this is not reliable). This is mainly for efficiency. (the example from above is for
.query; this will always return a copy as its evaluated bynumexpr)An indexer that gets on a multiple-dtyped object is always a copy.
Your example of chained indexing
df[df.C <= df.B].loc[:,'B':'E']
is not guaranteed to work (and thus you should never do this).
Instead do:
df.loc[df.C <= df.B, 'B':'E']
as this is faster and will always work
The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.
ANSWER 2
Score 6
Since pandas 1.5.0, pandas has Copy-on-Write (CoW) mode that makes any dataframe/Series derived from another behave like a copy on views. When it is enabled, a copy is created only if data is shared with another dataframe/Series. With CoW disabled, operations like slicing creates a view (and unexpectedly changed the original if the new dataframe is changed) but with CoW, this creates a copy.
pd.options.mode.copy_on_write = False   # disable CoW (this is the default as of pandas 2.0)
df = pd.DataFrame({'A': range(4), 'B': list('abcd')})
df1 = df.iloc[:4]                       # view
df1.iloc[0] = 100
df.equals(df1)                          # True <--- df changes together with df1
pd.options.mode.copy_on_write = True    # enable CoW (this is planned to be the default by pandas 3.0)
df = pd.DataFrame({'A': range(4), 'B': list('abcd')})
df1 = df.iloc[:4]                       # copy because data is shared
df1.iloc[0] = 100
df.equals(df1)                          # False <--- df doesn't change when df1 changes
One consequence is, pandas operations are faster with CoW. In the following example, in the first case (when CoW is disabled), all intermediate steps create copies, while in the latter case (when CoW is enabled), a copy is created only at assignment (all intermediate steps are on views). You can see that there's a runtime difference because of that (in the latter case, data was not unnecessarily copied).
df = pd.DataFrame({'A': range(1_000_000), 'B': range(1_000_000)})
%%timeit
with pd.option_context('mode.copy_on_write', False):  # disable CoW in a context manager
    df1 = df.add_prefix('col ').set_index('col A').rename_axis('index col').reset_index()
# 30.5 ms ± 561 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
with pd.option_context('mode.copy_on_write', True):   # enable CoW in a context manager
    df2 = df.add_prefix('col ').set_index('col A').rename_axis('index col').reset_index()
# 18 ms ± 513 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)