Skip to content Skip to sidebar Skip to footer

Extracting Most Frequent Words Out Of A Corpus With Python

Maybe this is a stupid question, but I have a problem with extracting the ten most frequent words out of a corpus with Python. This is what I've got so far. (btw, I work with NLTK

Solution 1:

If you're using the NLTK anyway, try the FreqDist(samples) function to first generate a frequency distribution from the given sample. Then call the most_common(n) attribute to find the n most common words in the sample, sorted by descending frequency. Something like:

from nltk.probability importFreqDistfdist= FreqDist(stoplist)
top_ten = fdist.most_common(10)

Solution 2:

The pythonic way:

In [1]: from collections import Counter

In [2]: words = ['hello', 'hell', 'owl', 'hello', 'world', 'war', 'hello', 'war']

In [3]: counter_obj = Counter(words)

In [4]: counter_obj.most_common() #counter_obj.most_common(n=10)
Out[4]: [('hello', 3), ('war', 2), ('hell', 1), ('world', 1), ('owl', 1)]

Solution 3:

The problem is in your usage of set.

A set contains no duplicates, so when you create a set of words in lowercase, you only have one ocurrence of each word from there on.

Let's say your words are:

['banana', 'Banana', 'tomato', 'tomato','kiwi']

After your lambda lowering all cases you have:

['banana', 'banana', 'tomato', 'tomato','kiwi']

But then you do:

set(['banana', 'Banana', 'tomato', 'tomato','kiwi'])

which returns:

['banana','tomato','kiwi']

Since from that moment on you base your calculations on the no_capitals set, you'll get only one occurrence of each word. Don't create a set, and your program will probably work just fine.

Solution 4:

Here is one solution. Uses sets as discussed in the responses earlier.

deftoken_words(tokn=10, s1_orig='hello i must be going'):
    # tokn is the number of most common words.# s1_orig is the text blob that needs to be checked.# logic# - clean the text - remove punctuations.# - make everything lower case# - replace common machine read errors.# - create a dictionary with orig words and changed words.# - create a list of unique clean words# - read the "clean" text and count the number of clean words# - sort and print the results#print 'Number of tokens:', tokn# create a dictionary to make puncuations# spaces.
    punct_dict = {  ',':' ', 
                    '-':' ',
                    '.':' ',
                    '\n':' ',
                    '\r':' '
                    }

    # dictionary for machine reading errors
    mach_dict = {'1':'I', '0':'O',
                '6':'b','8':'B' }


    # get rid of punctuations
    s1 = s1_orig
    for k,v in punct_dict.items():
        s1 = s1.replace(k,v)

    # create the original list of words.
    orig_list = set(s1.split())

    # for each word in the original list,# see if it has machine errors.# add error words to a dict.
    error_words = dict()
    for a_word in orig_list:
        a_w2 = a_word
        for k,v in mach_dict.items():
            a_w2 = a_w2.replace(k,v)

        # lower case the result.
        a_w2 = a_w2.lower()

        # add to error word dict.try:
            error_words[a_w2].append(a_word)
        except:
            error_words[a_w2] = [a_word]

    # get rid of machine errors in the full text.for k,v in mach_dict.items():
        s1 = s1.replace(k,v)

    # make everything lower case
    s1 = s1.lower()

    # split sentence into list.
    s1_list = s1.split()

    # consider only unqiue words
    s1_set = set(s1_list)

    # count the number of times # the each word occurs in s1 
    res_dict = dict()   
    for a_word in s1_set:
        res_dict[a_word] = s1_list.count(a_word)


    # sort the result dictionary by valuesprint'--------------'
    temp = 0for key, value insorted(res_dict.iteritems(), reverse=True, key=lambda (k,v): (v,k)):
        if temp < tokn: 
            # print results for token items# get all the words that made up the key
            final_key = ''for er in error_words[key]:
                final_key = final_key + er + '|'
            final_key = final_key[0:-1]
            print"%s@%s" % (final_key, value)
        else:
            pass
        temp = temp + 1# close the function and returnreturnTrue#-------------------------------------------------------------    # main# read the inputs from command line
num_tokens = raw_input('Number of tokens desired: ')    
raw_file = raw_input('File name: ') 

# read the filetry:
    if num_tokens == '': num_tokens = 10
    n_t = int(num_tokens)
    raw_data = open(raw_file,'r').read()
    token_words(n_t, raw_data)
except:
    print'Token or file error.  Please try again.'

Post a Comment for "Extracting Most Frequent Words Out Of A Corpus With Python"