python counting number of presence and absence of substrings in list of sequences

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Track title: CC B Schuberts Piano Sonata No 16 D

--

Chapters
00:00 Question
02:25 Accepted answer (Score -1)
02:43 Answer 2 (Score 5)
04:01 Answer 3 (Score 3)
05:54 Answer 4 (Score 2)
08:37 Thank you

--

Full question
https://stackoverflow.com/questions/2002...

Question links:
[2shared]: http://www.2shared.com/document/j6ecquCd...

Answer 2 links:
[CountVectorizer]: http://scikit-learn.org/stable/modules/f...
[scikits-learn]: http://scikit-learn.org/stable/index.htm...
[here]: https://gist.github.com/elyase/7509072

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #string #numpy #scipy #bioinformatics

#avk47

ANSWER 1

Score 5

Excellent question. This is a classic computer science problem. Yes, there is indeed a better algorithm. Yours processes each long string 16384 times. A better way is to process each long string only once.

Rather than searching for each motif within each long string, instead you should just record which motifs appear in each long string. For example, if you were searching for length 2 motifs in the following string:

s = 'ACGTAC'

then you could run a loop over the length 2 substrings and record which ones are present in a dict:

motifAppearances = {}
for i in range(len(s)-1):
    motif = s[i:i+2]                   # grab a length=2 substring
    if motif not in motifAppearances:
        motifAppearances[motif] = 0    # initialize the count
    motifAppearances[motif] += 1       # increment the count

Now you've processed the entire string exactly once and found all the motifs present in it. In this case the resulting dict would look like:

motifAppearances = {'AC':2, 'CG':1, 'GT':1, 'TA':1}

Doing something similar for your case should cut down your run time by exactly 16384-fold.

ANSWER 2

Score 3

A clean and very fast way (~15s with OP's data) would be to use the CountVectorizer of scikits-learn as it uses numpy under the hood, for example:

from sklearn.feature_extraction.text import CountVectorizer

def make_chunks(s):
    width = 2
    return [s[i:i+width] for i in range(len(s)-width+1)]

l = ['ATTGCGGCTCACGAA', 'ACCTAGATACGACGG', 'CCCCTGTCCATGGTA']

vectorizer = CountVectorizer(tokenizer=make_chunks)
X = vectorizer.fit_transform(l)

Now X is a sparse matrix having all possible chunks as columns and sequence as rows, where every value is the number of occurrences of the given chunk in each sequence:

>>> X.toarray()
# aa ac ag at ca cc cg ...
[[1  1  0  1  1  0  2  1 1 2 1 0 0 1 1 1]      # ATTGCGGCTCACGAA
 [0  3  1  1  0  1  2  1 2 0 1 0 2 0 0 0]      # ACCTAGATACGACGG
 [0  0  0  1  1  4  0  1 0 0 1 2 1 1 2 0]]     # CCCCTGTCCATGGTA

>>> (X.toarray()>0).astype(int)  # the same but counting only once per sequence
[[1 1 0 1 1 0 1 1 1 1 1 0 0 1 1 1]
 [0 1 1 1 0 1 1 1 1 0 1 0 1 0 0 0]
 [0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 0]]

>>> vectorizer.get_feature_names()     # the columns(chunks)
[u'aa', u'ac', u'ag', u'at', u'ca', u'cc', u'cg', u'ct', u'ga', u'gc', u'gg', u'gt', u'ta', u'tc', u'tg', u'tt']

Now you can sum along the columns, mask non-values or whatever manipulation you need to do, for example:

>>> X.sum(axis=0)
[[1 4 1 3 2 5 4 3 3 2 3 2 3 2 3 1]]

Finally to find how many occurrences a given motif has, you must find the index of the corresponding motif/chunk and then evaluate in the previous sum:

>>> index = vectorizer.vocabulary_.get('ag')    # 'ag' is your motif
2   # this means third column

In your case you would have to divide your list in two parts(positive and negative values) in order to include the down condition. I made a quick test with the list from DSM's answer and it takes around 3-4 seconds in my computer to run. If I use 12 000 length 4000 sequences, then it takes a little less than a minute.

EDIT: The whole code using OP's data can be found here.

ANSWER 3

Score 2

There are several odd things about your code.

What you're calling "permutations" looks more like the Cartesian product, which can be computed using itertools.product.
Because Python is zero-indexed, the first element of a string is at index 0, and so the comparison like i[2].find(sMotif) < 1 will return True if the string is there right at the start, which seems a little odd.
Your OddsRatio, PValue and Enrichment calculations are inside the loop, but neither the zeroing of the counts nor the print is, which means you're computing them cumulatively for each new row but not doing anything with that information.
You recompute i[2].find(sMotif) multiple times in the typical case. That result isn't cached.

Assuming that I understand the numbers you're trying to compute -- and I could well be wrong, because there are several things you're doing that I don't understand -- I'd flip the logic. Instead of looping over each motif and trying to count it in each row, loop over each row and see what's there. That will be roughly 7*the number of rows instead of the number of motifs * the number of rows.

For example:

import random
from itertools import product
from collections import defaultdict, Counter

N = 12000
datalength = 400
listoflists = [[str(i), random.uniform(-1, 1), 
                ''.join([random.choice('AGCT') for c in range(datalength)])]
               for i in range(N)]

def chunk(seq, width):
    for i in range(len(seq)-width+1):
        yield seq[i:i+width]

def count_motifs(datatriples, width=7):
    motif_counts_by_down = defaultdict(Counter)
    nonmotif_counts_by_down = defaultdict(Counter)
    all_motifs = set(''.join(p) for p in product('AGCT',repeat=width))
    for symbol, value, sdata in datatriples:
        down = value < -0.5

        # what did we see?
        motifs_seen = set(chunk(sdata, width))
        # what didn't we see?
        motifs_not_seen = all_motifs - motifs_seen

        # accumulate these
        motif_counts_by_down[down].update(motifs_seen)
        nonmotif_counts_by_down[down].update(motifs_not_seen)

    return motif_counts_by_down, nonmotif_counts_by_down

(I lowered the linelength just to make the output faster; if the line is 10 times longer, the code takes 10 times as long.)

This produces on my slow laptop (after inserting some linebreaks):

>>> %time mot, nomot = count_motifs(listoflists, 7)
CPU times: user 1min 50s, sys: 60 ms, total: 1min 50s
Wall time: 1min 50s

So I'd figure about 20 minutes for the full problem, which isn't bad for so little code. (We could speed up the motifs_not_seen part by doing the arithmetic instead, but that'd only get us a factor of two anyway.)

On a much smaller case, where it's easier to see the output:

>>> mot, nomot = count_motifs(listoflists, 2)
>>> mot
defaultdict(<class 'collections.Counter'>, 
{False: Counter({'CG': 61, 'TC': 58, 'AT': 55, 'GT': 54, 'CA': 53, 'GA': 53, 'AC': 52, 'CT': 51, 'CC': 50, 'AG': 49, 'TA': 48, 'GC': 47, 'GG': 45, 'TG': 45, 'AA': 43, 'TT': 40}), 
True: Counter({'CT': 27, 'GT': 26, 'TC': 24, 'GC': 23, 'TA': 23, 'AC': 22, 'AG': 21, 'TG': 21, 'CC': 19, 'CG': 19, 'CA': 19, 'GG': 18, 'TT': 17, 'GA': 17, 'AA': 16, 'AT': 16})})
>>> nomot
defaultdict(<class 'collections.Counter'>, 
{False: Counter({'TT': 31, 'AA': 28, 'GG': 26, 'TG': 26, 'GC': 24, 'TA': 23, 'AG': 22, 'CC': 21, 'CT': 20, 'AC': 19, 'GA': 18, 'CA': 18, 'GT': 17, 'AT': 16, 'TC': 13, 'CG': 10}), 
True: Counter({'AA': 13, 'AT': 13, 'GA': 12, 'TT': 12, 'GG': 11, 'CC': 10, 'CA': 10, 'CG': 10, 'AG': 8, 'TG': 8, 'AC': 7, 'GC': 6, 'TA': 6, 'TC': 5, 'GT': 3, 'CT': 2})})

ANSWER 4

Score 0

basically your problem is sequence comparision. Finding motif in sequence is a fundamental question in Bioinfomatics. I think you could search for some existed algo or packages. I searched the keywords "motif match" in google and this is what I found in first page: http://biowhat.ucsd.edu/homer/motif/ http://meme.nbcr.net/meme/doc/mast.html http://www.biomedcentral.com/1471-2105/8/189 http://www.benoslab.pitt.edu/stamp/