dictionaries feature extraction Python - python-3.x

I'm doing a text categorization experiment. For the feature extraction phase I'm trying to create a feature dictionary per document. For now, I have two features, Type token ratio and n-grams of the relative frequency of function words. When I print my instances, only the feature type token ratio is in the dictionary. This seems to be because an ill functioning get_pos(). It returns empty lists.
This is my code:
instances = []
labels = []
directory = "\\Users\OneDrive\Data"
for dname, dirs, files in os.walk(directory):
for fname in files:
fpath = os.path.join(dname, fname)
with open(fpath,'r') as f:
text = csv.reader(f, delimiter='\t')
vector = {}
#TTR
lemmas = get_lemmas(text)
unique_lem = set(lemmas)
TTR = str(len(unique_lem) / len(lemmas))
name = fname[:5]
vector['TTR'+ '+' + name] = TTR
#function word ngrams
pos = get_pos(text)
fw = []
regex = re.compile(
r'(LID)|(VNW)|(ADJ)|(TW)|(VZ)|(VG)|(BW)')
for tag in pos:
if regex.search(tag):
fw.append(tag)
for n in [1,2,3]:
grams = ngrams(fw, n)
fdist = FreqDist(grams)
total = sum(c for g,c in fdist.items())
for gram, count in fdist.items():
vector['fw'+str(n)+'+'+' '+ name.join(gram)] = count/total
instances.append(vector)
labels.append(fname[:1])
print(instances)
And this is an example of a Dutch input file:
This is the code from the get_pos function, which I call from another script:
def get_pos(text):
row4=[]
pos = []
for row in text:
if not row:
continue
else:
row4.append(row[4])
pos = [x.split('(')[0] for x in row4] # remove what's between the brackets
return pos
Can you help me find what's wrong with the get_pos function?

When you call get_lemmas(text), all contents of the file are consumed, so get_pos(text) has nothing left to iterate over. If you want to go through a file's content multiple times, you need to either f.seek(0) between the calls, or read the rows into a list in the beginning and iterate over the list when needed.

Related

What is the best way to remove rare words from a large text?

The dataset contains 2.14M words
The following is my code.
uni = get_unique(ds) #to get all unique words
c = Counter(uni) #using counter from Collections to create a dictionary
v = list(c.values()) #dict value
ky = list(c.keys()) #dicks keys
junk = [] #indexes of rare words (words that appear less than 20 times)
num = 0 #the number words that appear more than 20 times
for i in range(len(v)):
if(v[i] >= 20):
num += 1
else:
junk.append(i)
rare_words = []
for i in junk:
rare_words.append(ky[i]) #selecting rare words from the keys
A function to remove the rare words
def remove_jnk(dataset,rare_words):
ds = []
for i in dataset:
repl_wrd = " "
res = " ".join([repl_wrd if idx in rare_words else idx for idx in i[0].split()])
ds.append([res])
return ds
ds = remove_jnk(ds, rare_words)
this is too slow and it's taking hours to run
Maybe try importing a library suck as NLTK and do something like:
import nltk
tokens = [] #your word list
freq_dist = nltk.FreqDist(tokens)
FreqDist() is used to get the distribution of the terms in the corpus, selecting the rarest one into a list
rarewords = freq_dist.keys()[-5:]
after_rare_words = [ word for word in token not in rarewords]

How do I count all occurrences of a phrase in a text file using regular expressions?

I am reading in multiple files from a directory and attempting to find how many times a specific phrase (in this instance "at least") occurs in each file (not just that it occurs, but how many times in each text file it occurs) My code is as follows
import glob
import os
path = 'D:/Test'
k = 0
for filename in glob.glob(os.path.join(path, '*.txt')):
if filename.endswith('.txt'):
f = open(filename)
data = f.read()
data.split()
data.lower()
S = re.findall(r' at least ', data, re.MULTILINE)
count = []
if S == True:
for S in data:
count.append(data.count(S))
k= k + 1
print("'{}' match".format(filename), count)
else:
print("'{}' no match".format(filename))
print("Total number of matches", k)
At this moment I get no matches at all. I can count whether or not there is an occurrence of the phrase but am not sure why I can't get a count of all occurrences in each text file.
Any help would be appreciated.
regards
You can get rid of the regex entirely, the count-method of string objects is enough, much of the other code can be simplified as well.
You're also not changing data to lower case, just printing the string as lower case, note how I use data = data.lower() to actually change the variable.
Try this code:
import glob
import os
path = 'c:\script\lab\Tests'
k = 0
substring = ' at least '
for filename in glob.glob(os.path.join(path, '*.txt')):
if filename.endswith('.txt'):
f = open(filename)
data = f.read()
data = data.lower()
S= data.count(substring)
if S:
k= k + 1
print("'{}' match".format(filename), S)
else:
print("'{}' no match".format(filename))
print("Total number of matches", k)
If anything is unclear feel free to ask!
You make multiple mistakes in your code. data.split() and data.lower() have no effect at all, since the both do not modifiy data but return a modified version. However, you don't assign the return value to anything, so it is lost.
Also, you should always close a resource (e.g. a file) when you don't need it anymore.
Also, you append every string you find using re.search to a list S, which you dont use for anything anymore. It would also be pointless, because it would just contain the string you are looking for x amount of time. You can just take the list that is returned by re.search and comupute its length. This gives you the number of times it occurs in the text. Then you just increase your counter variable k by that amount and move on to the next file. You can still have your print statements by simply printing the temporary num_found variable.
import re
import glob
import os
path = 'D:/Test'
k = 0
for filename in glob.glob(os.path.join(path, '*.txt')):
if filename.endswith('.txt'):
f = open(filename)
text = f.read()
f.close()
num_found = len(re.findall(r' at least ', data, re.MULTILINE))
k += num_found

TypeError: 'int' object is not iterable when calculating mean

I am trying to read different values from a file and to store them in a list. After that, I need to take their mean and in doing so I am getting the error above. Code is working up to to line
"Avg_Humidity.append(words[8])"
Here it is:
def monthly_report(path,year,month):
pre_script="Murree_weather"
format='.txt'
file_name = pre_script + year + month+format
name_path=os.path.join(path,file_name)
file = open(name_path, 'r')
data = file.readlines()
Max_Temp = []
Min_Temp = []
Avg_Humidity = []
for line in data:
words = line.split(",")
Max_Temp.append(words[1])
Min_Temp.append(words[3])
Avg_Humidity.append(words[8])
Count_H, Count_Max_Temp, Count_Min_Temp, Mean_Max_Temp, Mean_Min_Temp,
Mean_Avg_Humidity=0
for iterate in range(1,len(Max_Temp)):
Mean_Max_Temp= Mean_Max_Temp+Max_Temp(iterate)
Count_Max_Temp=Count_Max_Temp+1
Mean_Max_Temp=Mean_Max_Temp/Count_Max_Temp
for iterate in range(1,len(Min_Temp)):
Mean_Min_Temp= Mean_Min_Temp+Min_Temp(iterate)
Count_Min_Temp=Count_Min_Temp+1
Mean_Min_Temp=Mean_Min_Temp/Count_Min_Temp
for iterate in range(1,len(Avg_Humidity)):
Mean_Avg_Humidity= Mean_Avg_Humidity+Avg_Humidity(iterate)
Count_H=Count_H+1
Mean_Avg_Humidity=Mean_Avg_Humidity/Count_H
print("Mean Average Humidity = ",Mean_Avg_Humidity)
print("Mean Maximum Temperature = ",Mean_Max_Temp)
print("Mean Minimum Temperature = ",Mean_Min_Temp)
return
This line is incorrect:
Count_H, Count_Max_Temp, Count_Min_Temp, Mean_Max_Temp, Mean_Min_Temp, Mean_Avg_Humidity = 0
To fix, change it to:
Count_H = Count_Max_Temp = Count_Min_Temp = Mean_Max_Temp = Mean_Min_Temp = Mean_Avg_Humidity = 0
An alternative fix would be to leave the commas as they are and change the right-hand side to a list or tuple of zeroes that has the same number of elements as the left-hand side. But that would be less clear, and harder to maintain.

Comparing keywords from one file to another file containing different tweet

I am currently doing a project for my Python course which is sentiment analysis for tweets. We have only just finished reading/writing files and stuff like split() and strip() using Python is class, so I am still a noob at programming.
The project involves two files, the keywords.txt file and tweets.txt file, sample of the files are:
sample of tweets.txt:
[41.298669629999999, -81.915329330000006] 6 2011-08-28 19:02:36 Work needs to fly by ... I'm so excited to see Spy Kids 4 with then love of my life ... ARREIC
[33.702900329999999, -117.95095704000001] 6 2011-08-28 19:03:13 Today
is going to be the greatest day of my life. Hired to take pictures at
my best friend's gparents 50th anniversary. 60 old people. Woo.
where the numbers in the brackets are coordinates, the numbers after that can be ignored, then the message/tweets comes after.
sample of keywords.txt:
alone,1
amazed,10
excited,10
love,10
where numbers represent the "sentimental value" of that keyword
What I am supposed to do is to read both files in Python, separate the words in each message/tweet and then check if any of the keywords is in each of the tweets, if the keywords are in the tweet, the add together the sentimental values. Finally, print the total of the sentiment values of each tweet, ignoring the tweets that does not contain any keywords.
So for example, the first tweet in the sample, two keywords are in the tweet (excited and love), so the total sentimental values would be 20.
however, in my code, it prints out the sentimental values separately as 10, 10, rather than printing out the total. And I also have no idea how to make it so that the function for checking keywords iterate over every tweet.
My code so far:
tweets = open("tweets.txt","r")
keywords = open("keywords.txt","r")
def tweetDataExtract (infile):
line = infile.readline()
if line == "":
return []
else:
parts = line.split(" ",5)
return parts
def keywordsDataExtract (infile):
line = infile.readline()
if line == "":
return[]
else:
parts = line.split(",",1)
return parts
tweetData = tweetDataExtract(tweets)
while len (tweetData) == 6:
lat = float(tweetData[0].strip("[,"))
long = float(tweetData[1].rstrip("]"))
message = tweetData[5].split(" ")
messageWords=[]
#gets rid of all the punctuation in the strip() brackets
for element in message:
element = element.strip("!#.,?[]{}#-_-:)('=/%;&*+|<>`~\n")
messageWords.append(element.lower())
tweetData = tweetDataExtract(tweets)
print(lat, long, messageWords)
keywordsData = keywordsDataExtract(keywords)
while len (keywordsData) == 2:
words = keywordsData[0]
happiness = int(keywordsData[1])
keywordsData = keywordsDataExtract(keywords)
count = 0
sentiment = 0
if words in messageWords:
sentiment+=happiness
count+=1
print (lat, long, count, sentiment)
tweets.close()
keywords.close()
How can I fix the code?
PS I didn't know which part of the code would be essential to post, so I just posted the whole thing so far.
The problem was that you had initialised the variables count and sentiment inside the while loop itself. I hope you realise its consequences!!
Corrected code :
tweets = open("tweets.txt","r")
keywords = open("keywords.txt","r")
def tweetDataExtract (infile):
line = infile.readline()
if line == "\n":
# print("hello")
return tweetDataExtract(infile)
else:
parts = line.split(" ",5)
return parts
keywordsData = [line.split(',') for line in keywords]
tweetData = tweetDataExtract(tweets)
while len(tweetData) == 6:
lat = float(tweetData[0].strip("[,"))
long = float(tweetData[1].rstrip("]"))
message = tweetData[5].split(" ")
messageWords=[]
#gets rid of all the punctuation in the strip() brackets
for element in message:
element = element.strip("!#.,?[]{}#-_-:)('=/%;&*+|<>`~\n")
messageWords.append(element.lower())
tweetData = tweetDataExtract(tweets)
count = 0
sentiment = 0
for i in range(0,len (keywordsData)):
words = keywordsData[i][0]
happiness = int(keywordsData[i][1].strip())
if words in messageWords:
sentiment+=happiness
count+=1
print (lat, long, count, sentiment)
tweets.close()
keywords.close()
See this new code (shorter and pythonic):
import string
dic = {}
tweets = []
with open("tweets.txt",'r') as f:
tweets = [line.strip() for line in f if line.strip() != '']
with open("keywords.txt",'r') as f:
dic = {line.strip().split(',')[0]:line.strip().split(',')[1] for line in f if line.strip()!=''}
for t in tweets:
t = t.split(" ",5)
lat = float(t[0].strip("[,"))
lon = float(t[1].rstrip("]"))
sentiment = 0
for word in t[5].translate(str.maketrans("","", string.punctuation)).lower().split():
if word in dic:
sentiment+=int(dic[word])
print(lat,lon,sentiment)
Output:
41.29866963 -81.91532933 20
33.70290033 -117.95095704 0

Counting Frequencies

I am trying to figure out how to count the number of frequencies the word tags I-GENE and O appeared in a file.
The example of the file I'm trying to compute is this:
45 WORDTAG O cortex
2 WORDTAG I-GENE cdc33
4 WORDTAG O PPRE
4 WORDTAG O How
44 WORDTAG O if
I am trying to compute the sum of word[0] (column 1) in the same category (ex. I-GENE) same with category (ex. O)
In this example:
The sum of words with category of I-GENE is 2
and the sum of words with category of O is 97
MY CODE:
import os
def reading_files (path):
counter = 0
for root, dirs, files in os.walk(path):
for file in files:
if file != ".DS_Store":
if file == "gene.counts":
open_file = open(root+file, 'r', encoding = "ISO-8859-1")
for line in open_file:
tmp = line.split(' ')
for words in tmp:
for word in words:
if (words[2]=='I-GENE'):
sum = sum + int(words[0]
if (words[2] == 'O'):
sum = sum + int(words[0])
else:
print('Nothing')
print(sum)
I think you should delete the word loop - you don't use it
for word in words:
I would use a dictionary for this - if you want solve this generally.
While you read the file, fill a dictionary with:
- if you have the key in the dict already -> Increase the value for it
- If it is a new key, then add to the dict, and set value to it's value.
def reading_files (path):
freqDict = dict()
...
for words in tmp:
if words[2] not in freqDict():
freqDict[words[2]] = 0
freqDict[words[2]] += int(words[0])
After you created the dictionary, you can return it and use it with keyword, or you can pass a keyword for the function, and return the value or just print it.
I prefer the first one - Use as less file IO operation as possible. You can use the collected data from memory.
For this solution I wrote a wrapper:
def getValue(fDict, key):
if key not in fDict:
return "Nothing"
return str(fDict[key])
So it will behave like your example.
It is not neccessary, but a good practice: close the file when you are not using it anymore.

Resources