Counting Frequencies - python-3.x

I am trying to figure out how to count the number of frequencies the word tags I-GENE and O appeared in a file.
The example of the file I'm trying to compute is this:
45 WORDTAG O cortex
2 WORDTAG I-GENE cdc33
4 WORDTAG O PPRE
4 WORDTAG O How
44 WORDTAG O if
I am trying to compute the sum of word[0] (column 1) in the same category (ex. I-GENE) same with category (ex. O)
In this example:
The sum of words with category of I-GENE is 2
and the sum of words with category of O is 97
MY CODE:
import os
def reading_files (path):
counter = 0
for root, dirs, files in os.walk(path):
for file in files:
if file != ".DS_Store":
if file == "gene.counts":
open_file = open(root+file, 'r', encoding = "ISO-8859-1")
for line in open_file:
tmp = line.split(' ')
for words in tmp:
for word in words:
if (words[2]=='I-GENE'):
sum = sum + int(words[0]
if (words[2] == 'O'):
sum = sum + int(words[0])
else:
print('Nothing')
print(sum)

I think you should delete the word loop - you don't use it
for word in words:
I would use a dictionary for this - if you want solve this generally.
While you read the file, fill a dictionary with:
- if you have the key in the dict already -> Increase the value for it
- If it is a new key, then add to the dict, and set value to it's value.
def reading_files (path):
freqDict = dict()
...
for words in tmp:
if words[2] not in freqDict():
freqDict[words[2]] = 0
freqDict[words[2]] += int(words[0])
After you created the dictionary, you can return it and use it with keyword, or you can pass a keyword for the function, and return the value or just print it.
I prefer the first one - Use as less file IO operation as possible. You can use the collected data from memory.
For this solution I wrote a wrapper:
def getValue(fDict, key):
if key not in fDict:
return "Nothing"
return str(fDict[key])
So it will behave like your example.
It is not neccessary, but a good practice: close the file when you are not using it anymore.

Related

Count the number of times a word is repeated in a text file

I need to write a program that prompts for the name of a text file and prints the words with the maximum and minimum frequency, along with their frequency (separated by a space).
This is my text
I am Sam
Sam I am
That Sam-I-am
That Sam-I-am
I do not like
that Sam-I-am
Do you like
green eggs and ham
I do not like them
Sam-I-am
I do not like
green eggs and ham
Code:
file = open(fname,'r')
dict1 = []
for line in file:
line = line.lower()
x = line.split(' ')
if x in dict1:
dict1[x] += 1
else:
dict1[x] = 1
Then I wanted to iterate over the keys and values and find out which one was the max and min frequency however up to that point my console says
TypeError: list indices must be integers or slices, not list
I don't know what that means either.
For this problem the expected result is:
Max frequency: i 5
Min frequency: you 1
you are using a list instead of a dictionary to store the word frequencies. You can't use a list to store key-value pairs like this, you need to use a dictionary instead. Here is how you could modify your code to use a dictionary to store the word frequencies:
file = open(fname,'r')
word_frequencies = {} # use a dictionary to store the word frequencies
for line in file:
line = line.lower()
words = line.split(' ')
for word in words:
if word in word_frequencies:
word_frequencies[word] += 1
else:
word_frequencies[word] = 1
Then to iterate over the keys and find the min and max frequency
# iterate over the keys and values in the word_frequencies dictionary
# and find the word with the max and min frequency
max_word = None
min_word = None
max_frequency = 0
min_frequency = float('inf')
for word, frequency in word_frequencies.items():
if frequency > max_frequency:
max_word = word
max_frequency = frequency
if frequency < min_frequency:
min_word = word
min_frequency = frequency
Print the results
print("Max frequency:", max_word, max_frequency)
print("Min frequency:", min_word, min_frequency)

Comparing keywords from one file to another file containing different tweet

I am currently doing a project for my Python course which is sentiment analysis for tweets. We have only just finished reading/writing files and stuff like split() and strip() using Python is class, so I am still a noob at programming.
The project involves two files, the keywords.txt file and tweets.txt file, sample of the files are:
sample of tweets.txt:
[41.298669629999999, -81.915329330000006] 6 2011-08-28 19:02:36 Work needs to fly by ... I'm so excited to see Spy Kids 4 with then love of my life ... ARREIC
[33.702900329999999, -117.95095704000001] 6 2011-08-28 19:03:13 Today
is going to be the greatest day of my life. Hired to take pictures at
my best friend's gparents 50th anniversary. 60 old people. Woo.
where the numbers in the brackets are coordinates, the numbers after that can be ignored, then the message/tweets comes after.
sample of keywords.txt:
alone,1
amazed,10
excited,10
love,10
where numbers represent the "sentimental value" of that keyword
What I am supposed to do is to read both files in Python, separate the words in each message/tweet and then check if any of the keywords is in each of the tweets, if the keywords are in the tweet, the add together the sentimental values. Finally, print the total of the sentiment values of each tweet, ignoring the tweets that does not contain any keywords.
So for example, the first tweet in the sample, two keywords are in the tweet (excited and love), so the total sentimental values would be 20.
however, in my code, it prints out the sentimental values separately as 10, 10, rather than printing out the total. And I also have no idea how to make it so that the function for checking keywords iterate over every tweet.
My code so far:
tweets = open("tweets.txt","r")
keywords = open("keywords.txt","r")
def tweetDataExtract (infile):
line = infile.readline()
if line == "":
return []
else:
parts = line.split(" ",5)
return parts
def keywordsDataExtract (infile):
line = infile.readline()
if line == "":
return[]
else:
parts = line.split(",",1)
return parts
tweetData = tweetDataExtract(tweets)
while len (tweetData) == 6:
lat = float(tweetData[0].strip("[,"))
long = float(tweetData[1].rstrip("]"))
message = tweetData[5].split(" ")
messageWords=[]
#gets rid of all the punctuation in the strip() brackets
for element in message:
element = element.strip("!#.,?[]{}#-_-:)('=/%;&*+|<>`~\n")
messageWords.append(element.lower())
tweetData = tweetDataExtract(tweets)
print(lat, long, messageWords)
keywordsData = keywordsDataExtract(keywords)
while len (keywordsData) == 2:
words = keywordsData[0]
happiness = int(keywordsData[1])
keywordsData = keywordsDataExtract(keywords)
count = 0
sentiment = 0
if words in messageWords:
sentiment+=happiness
count+=1
print (lat, long, count, sentiment)
tweets.close()
keywords.close()
How can I fix the code?
PS I didn't know which part of the code would be essential to post, so I just posted the whole thing so far.
The problem was that you had initialised the variables count and sentiment inside the while loop itself. I hope you realise its consequences!!
Corrected code :
tweets = open("tweets.txt","r")
keywords = open("keywords.txt","r")
def tweetDataExtract (infile):
line = infile.readline()
if line == "\n":
# print("hello")
return tweetDataExtract(infile)
else:
parts = line.split(" ",5)
return parts
keywordsData = [line.split(',') for line in keywords]
tweetData = tweetDataExtract(tweets)
while len(tweetData) == 6:
lat = float(tweetData[0].strip("[,"))
long = float(tweetData[1].rstrip("]"))
message = tweetData[5].split(" ")
messageWords=[]
#gets rid of all the punctuation in the strip() brackets
for element in message:
element = element.strip("!#.,?[]{}#-_-:)('=/%;&*+|<>`~\n")
messageWords.append(element.lower())
tweetData = tweetDataExtract(tweets)
count = 0
sentiment = 0
for i in range(0,len (keywordsData)):
words = keywordsData[i][0]
happiness = int(keywordsData[i][1].strip())
if words in messageWords:
sentiment+=happiness
count+=1
print (lat, long, count, sentiment)
tweets.close()
keywords.close()
See this new code (shorter and pythonic):
import string
dic = {}
tweets = []
with open("tweets.txt",'r') as f:
tweets = [line.strip() for line in f if line.strip() != '']
with open("keywords.txt",'r') as f:
dic = {line.strip().split(',')[0]:line.strip().split(',')[1] for line in f if line.strip()!=''}
for t in tweets:
t = t.split(" ",5)
lat = float(t[0].strip("[,"))
lon = float(t[1].rstrip("]"))
sentiment = 0
for word in t[5].translate(str.maketrans("","", string.punctuation)).lower().split():
if word in dic:
sentiment+=int(dic[word])
print(lat,lon,sentiment)
Output:
41.29866963 -81.91532933 20
33.70290033 -117.95095704 0

dictionaries feature extraction Python

I'm doing a text categorization experiment. For the feature extraction phase I'm trying to create a feature dictionary per document. For now, I have two features, Type token ratio and n-grams of the relative frequency of function words. When I print my instances, only the feature type token ratio is in the dictionary. This seems to be because an ill functioning get_pos(). It returns empty lists.
This is my code:
instances = []
labels = []
directory = "\\Users\OneDrive\Data"
for dname, dirs, files in os.walk(directory):
for fname in files:
fpath = os.path.join(dname, fname)
with open(fpath,'r') as f:
text = csv.reader(f, delimiter='\t')
vector = {}
#TTR
lemmas = get_lemmas(text)
unique_lem = set(lemmas)
TTR = str(len(unique_lem) / len(lemmas))
name = fname[:5]
vector['TTR'+ '+' + name] = TTR
#function word ngrams
pos = get_pos(text)
fw = []
regex = re.compile(
r'(LID)|(VNW)|(ADJ)|(TW)|(VZ)|(VG)|(BW)')
for tag in pos:
if regex.search(tag):
fw.append(tag)
for n in [1,2,3]:
grams = ngrams(fw, n)
fdist = FreqDist(grams)
total = sum(c for g,c in fdist.items())
for gram, count in fdist.items():
vector['fw'+str(n)+'+'+' '+ name.join(gram)] = count/total
instances.append(vector)
labels.append(fname[:1])
print(instances)
And this is an example of a Dutch input file:
This is the code from the get_pos function, which I call from another script:
def get_pos(text):
row4=[]
pos = []
for row in text:
if not row:
continue
else:
row4.append(row[4])
pos = [x.split('(')[0] for x in row4] # remove what's between the brackets
return pos
Can you help me find what's wrong with the get_pos function?
When you call get_lemmas(text), all contents of the file are consumed, so get_pos(text) has nothing left to iterate over. If you want to go through a file's content multiple times, you need to either f.seek(0) between the calls, or read the rows into a list in the beginning and iterate over the list when needed.

Using the random function in Python for Evil Hangman

What I am trying to do is alter my original hangman game into what is called evil hangman. In order to do this, I need to first generate a random length of a word and pull out all words of that length from the original list.
Here is the code I am working with:
def setUp():
"""shows instructions, reads file,and returns a list of words from the english dictionary"""
try:
print(60*'*' +'''\n\t\tWelcome to Hangman!\n\t
I have selected a word from an english dictionary. \n\t
I will first show you the length of the secret word\n\t
as a series of dashes.\n\t
Your task is to guess the secret word one letter at a time.\n\t
If you guess a correct letter I will show you the guessed\n\t
letter(s) in the correct position.\n
You can only make 8 wrong guesses before you are hanged\n
\t\tGood luck\n''' + 60*'*')
infile=open('dictionary.txt')
l=infile.readlines()# list of words from which to choose
infile.close()
cleanList = []
for word in l:
cleanList.append(l[:-1])
return(cleanList)
except IOError:
print('There was a problem loading the dictionary file as is.')
def sort_dict_words_by_length(words):
"""Given a list containing words of different length,
sort those words based on their length."""
d = defaultdict(list)
for word in words:
d[len(word)].append(word)
return d
def pick_random_length_from_dictionary(diction):
max_len, min_len = ( f(diction.keys()) for f in (max, min) )
length = random.randint(min_len, max_len)
return diction[length]
def playRound(w,g):
""" It allows user to guess one letter. If right,places letter in correct positions in current guess string g, and shows current guess to user
if not, increments w, number of wrongs. Returns current number of wrongs and current guess string"""
print('You have ' + str(8 - w) + ' possible wrong guesses left.\n')
newLetter = input('Please guess a letter of the secret word:\n')
glist = list(g)#need to make changes to current guess string so need a mutable version of it
if newLetter in secretWord:
for j in range (0,len(secretWord)):
if secretWord[j]==newLetter:
glist[j] = newLetter
g = ''.join(glist)#reassemble the guess as a string
print('Your letter is indeed present in the secret word: ' + ' '.join(g)+'\n')
else:
w += 1
print('Sorry, there are no ' + newLetter + ' in the secret word. Try again.\n')
return(w,g)
def endRound(wr, w,l):
"""determines whether user guessed secret word, in which case updates s[0], or failed after w=8 attempts, in s\which case it updates s[1]"""
if wr == 8:
l += 1
print('Sorry, you have lost this game.\n\nThe secret word was '+secretWord +'\n')#minor violation of encapsulation
else:
w +=1
print(15*'*' + 'You got it!' + 15*'*')
return(w,l)
def askIfMore():
"""ask user if s/he wants to play another round of the game"""
while True:
more = input('Would you like to play another round?(y/n)')
if more[0].upper() == 'Y' or more[0].upper()=='N':
return more[0].upper()
else:
continue
def printStats(w,l):
"""prints final statistics"""
wGames='games'
lGames = 'games'
if w == 1:
wGames = 'game'
if l ==1:
lGames = 'game'
print('''Thank you for playing with us!\nYou have won {} {} and lost {} {}.\nGoodbye.'''.format(w,wGames,l,lGames))
try:
import random
from collections import defaultdict
words=setUp()#list of words from which to choose
won, lost = 0,0 #accumulators for games won, and lost
while True:
wrongs=0 # accumulator for wrong guesses
secretWord = random.choice(words)[:#eliminates '\n' at the end of each line
print(secretWord) #for testing purposes
guess= len(secretWord)*'_'
print('Secret Word:' + ' '.join(guess))
while wrongs < 8 and guess != secretWord:
wrongs, guess = playRound(wrongs, guess)
won, lost = endRound(wrongs,won,lost)
if askIfMore()== 'N':
break
printStats(won, lost)
except:
quit()
What I would like to do is generate a random number with the lower bound being the shortest length word and the upper bound being the highest length word, and then use that random number to create a new container with words of only that length, and finally returning that container to be used by the game further. I tried using min and max, but it seems to only return the first and last item of the list instead of showing the word with the most characters. Any help is appreciated.
If your 'dictionary.txt' has a single word on each line, you could use the following, which is speed efficient, because it'll only go over the list once. But it'll consume the memory of your original list again.
from collections import defaultdict
import random
def sort_dict_words_by_length(words):
"""Given a list containing words of different length,
sort those words based on their length."""
d = defaultdict(list)
for word in words:
d[len(word)].append(word)
return d
def pick_random_length_from_dictionary(diction):
max_len, min_len = ( f(diction.keys()) for f in (max, min) )
length = random.randint(min_len, max_len)
return diction[length]
You would then pass the output from your setUp to sort_dict_words_by_length and that output to pick_random_length_from_dictionary.
If you are memory-limited, then you should first go over all words in the wordlist, keeping track of the minimal and maximal length of those words and then reiterate over that wordlist, appending only those words of the desired length. What you need for that is mentioned in the code above and just requires some code reshuffling. I'll leave that up to you as an exercise.

Selecting a random value from a dictionary in python

Here is my function:
def evilSetup():
words = setUp()
result = {}
char = input('Please enter your one letter guess: ')
for word in words:
key = ' '.join(char if c == char else '-' for c in word)
if key not in result:
result[key] = []
result[key].append(word)
return max(result.items(), key=lambda keyValue: len(keyValue[1]))
from collections import defaultdict
import random
words= evilSetup()#list of words from which to choose
won, lost = 0,0 #accumulators for games won, and lost
while True:
wrongs=0 # accumulator for wrong guesses
secretWord = words
print(secretWord) #for testing purposes
guess= len(secretWord)*'_'
print('Secret Word:' + ' '.join(guess))
while wrongs < 8 and guess != secretWord:
wrongs, guess = playRound(wrongs, guess)
won, lost = endRound(wrongs,won,lost)
if askIfMore()== 'N':
break
printStats(won, lost)
The function will take a list of words, and sort them into a dictionary based on the position of the guessed letter. As of now, it returns the key,value pair that is the largest. What I would like it to ultimately return is a random word from the biggest dictionary entry. The values as of now are in the form of a list.
For example {- - -, ['aah', 'aal', 'aas']}
Ideally I would grab a random word from this list to return. Any help is appreciated. Thanks in advance.
If you have a list lst, then you can simply do:
random_word = random.choice(lst)
to get a random entry of the list. So here, you will want something like:
return random.choice(max(result.items(), key=lambda kv: len(kv[1]))[1])
# ^^^^^^^^^^^^^^ ^^^^

Resources