The Python Oracle

How to iterate over rows in a DataFrame in Pandas

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: Light Drops

--

Chapters
00:00 Question
00:51 Accepted answer (Score 4898)
01:12 Answer 2 (Score 2149)
11:09 Answer 3 (Score 554)
13:11 Answer 4 (Score 244)
13:29 Thank you

--

Full question
https://stackoverflow.com/questions/1647...

Question links:
[similar question]: https://stackoverflow.com/questions/7837...

Accepted answer links:
[DataFrame.iterrows]: https://pandas.pydata.org/pandas-docs/st...

Answer 2 links:
[DataFrame.to_string()]: https://pandas.pydata.org/pandas-docs/st...
[here]: https://stackoverflow.com/questions/2487...
[Cython]: https://en.wikipedia.org/wiki/Cython
[DataFrame.apply()]: https://pandas.pydata.org/pandas-docs/st...
[DataFrame.itertuples()]: https://pandas.pydata.org/pandas-docs/st...
[iteritems()]: https://pandas.pydata.org/pandas-docs/st...
[DataFrame.iterrows()]: https://pandas.pydata.org/pandas-docs/st...
[The documentation page]: https://pandas.pydata.org/pandas-docs/st...
[Vectorization]: https://stackoverflow.com/questions/1422...
[Cython]: https://cython.org
[Essential Basic Functionality]: https://pandas.pydata.org/pandas-docs/st...
[Cython extensions]: https://pandas.pydata.org/pandas-docs/st...
[List Comprehensions]: https://docs.python.org/3/tutorial/datas...
[good amount of evidence]: https://stackoverflow.com/questions/5402...
[Benchmarking code, for your reference]: https://gist.github.com/Coldsp33d/948f96...
[this post of mine]: https://stackoverflow.com/questions/5443...
[10 Minutes to pandas]: https://pandas.pydata.org/pandas-docs/st...
[Essential Basic Functionality]: https://pandas.pydata.org/pandas-docs/st...
[Enhancing Performance]: https://pandas.pydata.org/pandas-docs/st...
[Are for-loops in pandas really bad? When should I care?]: https://stackoverflow.com/questions/5402...
[When should I (not) want to use pandas apply() in my code?]: https://stackoverflow.com/questions/5443...

Answer 3 links:
[this answer]: https://stackoverflow.com/a/55557758/384...
[DataFrame.iterrows()]: http://pandas.pydata.org/pandas-docs/sta...
[DataFrame.itertuples()]: http://pandas.pydata.org/pandas-docs/sta...
[DataFrame.apply()]: http://pandas.pydata.org/pandas-docs/sta...
[pandas docs on iteration]: https://pandas.pydata.org/docs/user_guid...

Answer 4 links:
[df.iterrows()]: http://pandas.pydata.org/pandas-docs/sta...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #pandas #dataframe

#avk47



ACCEPTED ANSWER

Score 5449


DataFrame.iterrows is a generator which yields both the index and row (as a Series):

import pandas as pd

df = pd.DataFrame({'c1': [10, 11, 12], 'c2': [100, 110, 120]})
df = df.reset_index()  # make sure indexes pair with number of rows

for index, row in df.iterrows():
    print(row['c1'], row['c2'])
10 100
11 110
12 120

Obligatory disclaimer from the documentation

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed and can be avoided with one of the following approaches:

  • Look for a vectorized solution: many operations can be performed using built-in methods or NumPy functions, (boolean) indexing, …
  • When you have a function that cannot work on the full DataFrame/Series at once, it is better to use apply() instead of iterating over the values. See the docs on function application.
  • If you need to do iterative manipulations on the values but performance is important, consider writing the inner loop with cython or numba. See the enhancing performance section for some examples of this approach.

Other answers in this thread delve into greater depth on alternatives to iter* functions if you are interested to learn more.




ANSWER 2

Score 578


First consider if you really need to iterate over rows in a DataFrame. See cs95's answer for alternatives.

If you still need to iterate over rows, you can use methods below. Note some important caveats which are not mentioned in any of the other answers.

itertuples() is supposed to be faster than iterrows()

But be aware, according to the docs (pandas 0.24.2 at the moment):

  • iterrows: dtype might not match from row to row

    Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally much faster than iterrows()

  • iterrows: Do not modify rows

    You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

    Use DataFrame.apply() instead:

    new_df = df.apply(lambda x: x * 2, axis=1)
    
  • itertuples:

    The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.

See pandas docs on iteration for more details.




ANSWER 3

Score 255


You should use df.iterrows(). Though iterating row-by-row is not especially efficient since Series objects have to be created.




ANSWER 4

Score 189


While iterrows() is a good option, sometimes itertuples() can be much faster:

df = pd.DataFrame({'a': randn(1000), 'b': randn(1000),'N': randint(100, 1000, (1000)), 'x': 'x'})

%timeit [row.a * 2 for idx, row in df.iterrows()]
# => 10 loops, best of 3: 50.3 ms per loop

%timeit [row[1] * 2 for row in df.itertuples()]
# => 1000 loops, best of 3: 541 µs per loop