The Python Oracle

Implementing the Alon-Matias-Szegedy Algorithm For The Second Moment Stream Approximation

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: Magic Ocean Looping

--

Chapters
00:00 Question
04:13 Accepted answer (Score 3)
05:47 Thank you

--

Full question
https://stackoverflow.com/questions/3611...

Accepted answer links:
[collections.Counter]: https://pymotw.com/2/collections/counter...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #random #datamining #datastream #bigdata

#avk47



ACCEPTED ANSWER

Score 3


This is an interesting question.

Say we start with

import random
import string

size = 100000
seq = [random.choice(string.ascii_letters) for x in range(size)]

Then the first implementation is similar to yours (note the use of collections.Counter, though):

from collections import Counter

def secondMoment(seq):
    c = Counter(seq)
    return sum(v**2 for v in c.values())

>>> secondMoment(seq)
192436972

The second implementation differs more significantly than yours, though. Note that first the random indices are found. Then, an element is counted only after its first occurrence (if any) at one of the indices:

from collections import defaultdict

def AMSestimate(seq, num_samples=10):
    inds = list(range(len(seq)))
    random.shuffle(inds)
    inds = sorted(inds[: num_samples])

    d = {}
    for i, c in enumerate(seq):
        if i in inds and c not in d:
            d[c] = 0
        if c in d:
            d[c] += 1
    return int(len(seq) / float(len(d)) * sum((2 * v - 1) for v in d.values()))

>>> AMSestimate(seq)
171020000

Edit Regarding The Original Code In The Question

In the code in the question, consider your loop

for el in vector:
    if el in elements:
        elements[el] += 1
    elif random.choice(range(0, 10)) == 0:
        elements[el] = 1

(Minor) The sampling is problematic: it is hard-coded probabilistic at 0.1

Also consider:

    estimateM2 += lenvect * ((2 * value) - 1)

This lacks a division by the number of sampled elements.