Counting occurrences in a sequence with itertools.groupby

itertools.groupby is a great tool for counting the numbers of occurrences in a sequence.

Here are some examples from the interactive interpreter.

A list of numbers

>>> # Create a random list of numbers
>>> from random import random
>>> numbers = [int(random() * 10) for x in range(20)]
>>> numbers
[8, 0, 3, 2, 3, 9, 8, 2, 8, 3, 0, 2, 3, 8, 6, 5, 3, 6, 1, 8]
>>> # Now create a dictionary of numbers and numbers
>>> # of occurrences. Feed generator expression of
>>> # (number, frequency) pairs to dict().
>>> from itertools import groupby
>>> valdict = dict((k, len(list(g)))
           for k, g in groupby(sorted(numbers)))
>>> for key, val in valdict.items():
    print key, ":", val

0 : 2
1 : 1
2 : 3
3 : 5
5 : 1
6 : 2
8 : 5
9 : 1

And a function that does this for any iterable:

from itertools import groupby

def count_occurrences(iterable):
    """return a dictionary with items and numbers of occurrences
    in iterable"
""

    return dict((item, len(list(group)))
        for item, group
        in groupby(sorted(iterable)))

Top 20 most frequent words in a file

>>> # get a wordlist from the Python README
>>> text = open("/python25/readme.txt").read()
>>> words = text.lower().split()
>>> words[:5]
['this', 'is', 'python', 'version', '2.5.2']
>>> # get the frequency list, using DSU to sort top words
>>> freqs = [(len(list(g)), k)
     for k, g in groupby((sorted(words)))]
>>> # sort the freqs, get last 20, and reverse
>>> # to put most frequent first
>>> for a, b in reversed(sorted(freqs)[-20:]):
    print "%s %s" % (b.ljust(7), str(a).rjust(3))

the     442
to      227
is      127
and     127
you     118
a       117
of      110
in      107
for      94
python   81
on       79
if       77
this     72
or       62
be       58
with     56
it       53
are      53
that     52
as       47

Here's a function that will do this.

from itertools import groupby

def get_top_freqs(filename, num=20):
    """Get the top num words from filename as a list
    of (word, freq) tuples
    "
""

    text = open(filename).read()
    words = text.lower().split()

    freqs = ((len(list(g)), k) for k, g in groupby(sorted(words)))

    return [(b, a) for a, b in reversed(sorted(freqs)[num*-1:])]

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>