March 13, 2008
Counting occurrences in a sequence with itertools.groupby
itertools.groupby is a great tool for counting the numbers of occurrences in a sequence.
Here are some examples from the interactive interpreter.
A list of numbers
>>> from random import random
>>> numbers = [int(random() * 10) for x in range(20)]
>>> numbers
[8, 0, 3, 2, 3, 9, 8, 2, 8, 3, 0, 2, 3, 8, 6, 5, 3, 6, 1, 8]
>>> # Now create a dictionary of numbers and numbers
>>> # of occurrences. Feed generator expression of
>>> # (number, frequency) pairs to dict().
>>> from itertools import groupby
>>> valdict = dict((k, len(list(g)))
for k, g in groupby(sorted(numbers)))
>>> for key, val in valdict.items():
print key, ":", val
0 : 2
1 : 1
2 : 3
3 : 5
5 : 1
6 : 2
8 : 5
9 : 1
And a function that does this for any iterable:
def count_occurrences(iterable):
"""return a dictionary with items and numbers of occurrences
in iterable"""
return dict((item, len(list(group)))
for item, group
in groupby(sorted(iterable)))
Top 20 most frequent words in a file
>>> text = open("/python25/readme.txt").read()
>>> words = text.lower().split()
>>> words[:5]
['this', 'is', 'python', 'version', '2.5.2′]
>>> # get the frequency list, using DSU to sort top words
>>> freqs = [(len(list(g)), k)
for k, g in groupby((sorted(words)))]
>>> # sort the freqs, get last 20, and reverse
>>> # to put most frequent first
>>> for a, b in reversed(sorted(freqs)[-20:]):
print "%s %s" % (b.ljust(7), str(a).rjust(3))
the 442
to 227
is 127
and 127
you 118
a 117
of 110
in 107
for 94
python 81
on 79
if 77
this 72
or 62
be 58
with 56
it 53
are 53
that 52
as 47
Here's a function that will do this.
from itertools import groupby
def get_top_freqs(filename, num=20):
"""Get the top num words from filename as a list
of (word, freq) tuples
"""
text = open(filename).read()
words = text.lower().split()
freqs = ((len(list(g)), k) for k, g in groupby(sorted(words)))
return [(b, a) for a, b in reversed(sorted(freqs)[num*-1:])]