Implementing the Alon-Matias-Szegedy Algorithm For The Second Moment Stream Approximation
Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn
--
Music by Eric Matyas
https://www.soundimage.org
Track title: Magic Ocean Looping
--
Chapters
00:00 Question
04:13 Accepted answer (Score 3)
05:47 Thank you
--
Full question
https://stackoverflow.com/questions/3611...
Accepted answer links:
[collections.Counter]: https://pymotw.com/2/collections/counter...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #random #datamining #datastream #bigdata
#avk47
--
Music by Eric Matyas
https://www.soundimage.org
Track title: Magic Ocean Looping
--
Chapters
00:00 Question
04:13 Accepted answer (Score 3)
05:47 Thank you
--
Full question
https://stackoverflow.com/questions/3611...
Accepted answer links:
[collections.Counter]: https://pymotw.com/2/collections/counter...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #random #datamining #datastream #bigdata
#avk47
ACCEPTED ANSWER
Score 3
This is an interesting question.
Say we start with
import random
import string
size = 100000
seq = [random.choice(string.ascii_letters) for x in range(size)]
Then the first implementation is similar to yours (note the use of collections.Counter, though):
from collections import Counter
def secondMoment(seq):
c = Counter(seq)
return sum(v**2 for v in c.values())
>>> secondMoment(seq)
192436972
The second implementation differs more significantly than yours, though. Note that first the random indices are found. Then, an element is counted only after its first occurrence (if any) at one of the indices:
from collections import defaultdict
def AMSestimate(seq, num_samples=10):
inds = list(range(len(seq)))
random.shuffle(inds)
inds = sorted(inds[: num_samples])
d = {}
for i, c in enumerate(seq):
if i in inds and c not in d:
d[c] = 0
if c in d:
d[c] += 1
return int(len(seq) / float(len(d)) * sum((2 * v - 1) for v in d.values()))
>>> AMSestimate(seq)
171020000
Edit Regarding The Original Code In The Question
In the code in the question, consider your loop
for el in vector:
if el in elements:
elements[el] += 1
elif random.choice(range(0, 10)) == 0:
elements[el] = 1
(Minor) The sampling is problematic: it is hard-coded probabilistic at 0.1
Also consider:
estimateM2 += lenvect * ((2 * value) - 1)
This lacks a division by the number of sampled elements.