series.unique vs list of set - performance
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Life in a Drop
--
Chapters
00:00 Series.Unique Vs List Of Set - Performance
00:40 Accepted Answer Score 19
01:15 Answer 2 Score 8
02:14 Answer 3 Score 4
03:02 Thank you
--
Full question
https://stackoverflow.com/questions/4683...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #pandas #unique
#avk47
ACCEPTED ANSWER
Score 19
It will depend on the data type. For numeric types, pd.unique should be significantly faster.
For strings, which are stored as python objects, there will be a much smaller difference, and set() will usually be competitive, as it is doing a very similar thing.
Some examples:
strs = np.repeat(np.array(['a', 'b', 'c'], dtype='O'), 10000)
In [11]: %timeit pd.unique(strs)
558 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [12]: %timeit list(set(strs))
531 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
nums = np.repeat(np.array([1, 2, 3]), 10000)
In [13]: %timeit pd.unique(nums)
230 µs ± 9.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [14]: %timeit list(set(nums))
2.16 ms ± 71 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
ANSWER 2
Score 8
It makes sense to use categorical dtype for the columns that have few unique values.
Demo:
df = pd.DataFrame(np.random.choice(['aa','bbbb','c','ddddd','EeeeE','xxx'], 10**6), columns=['Day'])
In [34]: %timeit list(set(df['Day']))
98.1 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [35]: %timeit df['Day'].unique()
82.9 ms ± 56.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
almost the same timing for 1M rows
Let's test category dtype:
In [37]: df['cat'] = df['Day'].astype('category')
In [38]: %timeit list(set(df['cat']))
93.7 ms ± 766 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [39]: %timeit df['cat'].unique()
25.1 ms ± 6.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
UPDATE: 500 unique values in 1.000.000 rows DF:
In [75]: a = pd.util.testing.rands_array(10, 500)
In [76]: df = pd.DataFrame({'Day':np.random.choice(a, 10**6)})
In [77]: df.shape
Out[77]: (1000000, 1)
In [78]: df.Day.nunique()
Out[78]: 500
In [79]: %timeit list(set(df['Day']))
55 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [80]: %timeit df['Day'].unique()
133 ms ± 3.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [81]: df['cat'] = df['Day'].astype('category')
In [82]: %timeit list(set(df['cat']))
102 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [83]: %timeit df['cat'].unique()
38.3 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Conclusion: it's always better to "timeit" on your real data - you might have different results...
ANSWER 3
Score 4
The result seems to be varying highly on the number of unique entries:
Here is some time-testing on ipl dataset (deliveries)
For column match_id with 577 unique ids: unique() seems to be very efficient
%timeit list(set(deliveries['match_id']))
27.5 ms ± 2.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit deliveries['match_id'].unique()
1.79 ms ± 322 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For column batting_team with 13 unique teams: list(set()) slightly better here
%timeit list(set(deliveries['batting_team']))
9.92 ms ± 945 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit deliveries['batting_team'].unique()
10.2 ms ± 315 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Edit: Trying to test the string theory. Running the same tests on a string column batsman with 436 unique entries
%timeit list(set(deliveries['batsman']))
9.32 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit deliveries['batsman'].unique()
8.06 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)