## Counting occurrences in a sequence with itertools.groupby

itertools.groupby is a great tool for counting the numbers of occurrences in a sequence.

Here are some examples from the interactive interpreter.

### A list of numbers

>>> # Create a random list of numbers
>>> from random import random
>>> numbers = [int(random() * 10) for x in range(20)]
>>> numbers
[8, 0, 3, 2, 3, 9, 8, 2, 8, 3, 0, 2, 3, 8, 6, 5, 3, 6, 1, 8]
>>> # Now create a dictionary of numbers and numbers
>>> # of occurrences. Feed generator expression of
>>> # (number, frequency) pairs to dict().
>>> from itertools import groupby
>>> valdict = dict((k, len(list(g)))
for k, g in groupby(sorted(numbers)))
>>> for key, val in valdict.items():
print key, ":", val

0 : 2
1 : 1
2 : 3
3 : 5
5 : 1
6 : 2
8 : 5
9 : 1

And a function that does this for any iterable:

from itertools import groupby

def count_occurrences(iterable):
"""return a dictionary with items and numbers of occurrences
in iterable"
""

return dict((item, len(list(group)))
for item, group
in groupby(sorted(iterable)))

### Top 20 most frequent words in a file

>>> # get a wordlist from the Python README
>>> words = text.lower().split()
>>> words[:5]
['this', 'is', 'python', 'version', '2.5.2']
>>> # get the frequency list, using DSU to sort top words
>>> freqs = [(len(list(g)), k)
for k, g in groupby((sorted(words)))]
>>> # sort the freqs, get last 20, and reverse
>>> # to put most frequent first
>>> for a, b in reversed(sorted(freqs)[-20:]):
print "%s %s" % (b.ljust(7), str(a).rjust(3))

the     442
to      227
is      127
and     127
you     118
a       117
of      110
in      107
for      94
python   81
on       79
if       77
this     72
or       62
be       58
with     56
it       53
are      53
that     52
as       47

Here's a function that will do this.

from itertools import groupby

def get_top_freqs(filename, num=20):
"""Get the top num words from filename as a list
of (word, freq) tuples
"
""