Comparing keywords from one file to another file containing different tweet - python-3.x

I am currently doing a project for my Python course which is sentiment analysis for tweets. We have only just finished reading/writing files and stuff like split() and strip() using Python is class, so I am still a noob at programming.
The project involves two files, the keywords.txt file and tweets.txt file, sample of the files are:
sample of tweets.txt:
[41.298669629999999, -81.915329330000006] 6 2011-08-28 19:02:36 Work needs to fly by ... I'm so excited to see Spy Kids 4 with then love of my life ... ARREIC
[33.702900329999999, -117.95095704000001] 6 2011-08-28 19:03:13 Today
is going to be the greatest day of my life. Hired to take pictures at
my best friend's gparents 50th anniversary. 60 old people. Woo.
where the numbers in the brackets are coordinates, the numbers after that can be ignored, then the message/tweets comes after.
sample of keywords.txt:
alone,1
amazed,10
excited,10
love,10
where numbers represent the "sentimental value" of that keyword
What I am supposed to do is to read both files in Python, separate the words in each message/tweet and then check if any of the keywords is in each of the tweets, if the keywords are in the tweet, the add together the sentimental values. Finally, print the total of the sentiment values of each tweet, ignoring the tweets that does not contain any keywords.
So for example, the first tweet in the sample, two keywords are in the tweet (excited and love), so the total sentimental values would be 20.
however, in my code, it prints out the sentimental values separately as 10, 10, rather than printing out the total. And I also have no idea how to make it so that the function for checking keywords iterate over every tweet.
My code so far:
tweets = open("tweets.txt","r")
keywords = open("keywords.txt","r")
def tweetDataExtract (infile):
line = infile.readline()
if line == "":
return []
else:
parts = line.split(" ",5)
return parts
def keywordsDataExtract (infile):
line = infile.readline()
if line == "":
return[]
else:
parts = line.split(",",1)
return parts
tweetData = tweetDataExtract(tweets)
while len (tweetData) == 6:
lat = float(tweetData[0].strip("[,"))
long = float(tweetData[1].rstrip("]"))
message = tweetData[5].split(" ")
messageWords=[]
#gets rid of all the punctuation in the strip() brackets
for element in message:
element = element.strip("!#.,?[]{}#-_-:)('=/%;&*+|<>`~\n")
messageWords.append(element.lower())
tweetData = tweetDataExtract(tweets)
print(lat, long, messageWords)
keywordsData = keywordsDataExtract(keywords)
while len (keywordsData) == 2:
words = keywordsData[0]
happiness = int(keywordsData[1])
keywordsData = keywordsDataExtract(keywords)
count = 0
sentiment = 0
if words in messageWords:
sentiment+=happiness
count+=1
print (lat, long, count, sentiment)
tweets.close()
keywords.close()
How can I fix the code?
PS I didn't know which part of the code would be essential to post, so I just posted the whole thing so far.

The problem was that you had initialised the variables count and sentiment inside the while loop itself. I hope you realise its consequences!!
Corrected code :
tweets = open("tweets.txt","r")
keywords = open("keywords.txt","r")
def tweetDataExtract (infile):
line = infile.readline()
if line == "\n":
# print("hello")
return tweetDataExtract(infile)
else:
parts = line.split(" ",5)
return parts
keywordsData = [line.split(',') for line in keywords]
tweetData = tweetDataExtract(tweets)
while len(tweetData) == 6:
lat = float(tweetData[0].strip("[,"))
long = float(tweetData[1].rstrip("]"))
message = tweetData[5].split(" ")
messageWords=[]
#gets rid of all the punctuation in the strip() brackets
for element in message:
element = element.strip("!#.,?[]{}#-_-:)('=/%;&*+|<>`~\n")
messageWords.append(element.lower())
tweetData = tweetDataExtract(tweets)
count = 0
sentiment = 0
for i in range(0,len (keywordsData)):
words = keywordsData[i][0]
happiness = int(keywordsData[i][1].strip())
if words in messageWords:
sentiment+=happiness
count+=1
print (lat, long, count, sentiment)
tweets.close()
keywords.close()
See this new code (shorter and pythonic):
import string
dic = {}
tweets = []
with open("tweets.txt",'r') as f:
tweets = [line.strip() for line in f if line.strip() != '']
with open("keywords.txt",'r') as f:
dic = {line.strip().split(',')[0]:line.strip().split(',')[1] for line in f if line.strip()!=''}
for t in tweets:
t = t.split(" ",5)
lat = float(t[0].strip("[,"))
lon = float(t[1].rstrip("]"))
sentiment = 0
for word in t[5].translate(str.maketrans("","", string.punctuation)).lower().split():
if word in dic:
sentiment+=int(dic[word])
print(lat,lon,sentiment)
Output:
41.29866963 -81.91532933 20
33.70290033 -117.95095704 0

Related

Count the number of times a word is repeated in a text file

I need to write a program that prompts for the name of a text file and prints the words with the maximum and minimum frequency, along with their frequency (separated by a space).
This is my text
I am Sam
Sam I am
That Sam-I-am
That Sam-I-am
I do not like
that Sam-I-am
Do you like
green eggs and ham
I do not like them
Sam-I-am
I do not like
green eggs and ham
Code:
file = open(fname,'r')
dict1 = []
for line in file:
line = line.lower()
x = line.split(' ')
if x in dict1:
dict1[x] += 1
else:
dict1[x] = 1
Then I wanted to iterate over the keys and values and find out which one was the max and min frequency however up to that point my console says
TypeError: list indices must be integers or slices, not list
I don't know what that means either.
For this problem the expected result is:
Max frequency: i 5
Min frequency: you 1
you are using a list instead of a dictionary to store the word frequencies. You can't use a list to store key-value pairs like this, you need to use a dictionary instead. Here is how you could modify your code to use a dictionary to store the word frequencies:
file = open(fname,'r')
word_frequencies = {} # use a dictionary to store the word frequencies
for line in file:
line = line.lower()
words = line.split(' ')
for word in words:
if word in word_frequencies:
word_frequencies[word] += 1
else:
word_frequencies[word] = 1
Then to iterate over the keys and find the min and max frequency
# iterate over the keys and values in the word_frequencies dictionary
# and find the word with the max and min frequency
max_word = None
min_word = None
max_frequency = 0
min_frequency = float('inf')
for word, frequency in word_frequencies.items():
if frequency > max_frequency:
max_word = word
max_frequency = frequency
if frequency < min_frequency:
min_word = word
min_frequency = frequency
Print the results
print("Max frequency:", max_word, max_frequency)
print("Min frequency:", min_word, min_frequency)

What is the best way to remove rare words from a large text?

The dataset contains 2.14M words
The following is my code.
uni = get_unique(ds) #to get all unique words
c = Counter(uni) #using counter from Collections to create a dictionary
v = list(c.values()) #dict value
ky = list(c.keys()) #dicks keys
junk = [] #indexes of rare words (words that appear less than 20 times)
num = 0 #the number words that appear more than 20 times
for i in range(len(v)):
if(v[i] >= 20):
num += 1
else:
junk.append(i)
rare_words = []
for i in junk:
rare_words.append(ky[i]) #selecting rare words from the keys
A function to remove the rare words
def remove_jnk(dataset,rare_words):
ds = []
for i in dataset:
repl_wrd = " "
res = " ".join([repl_wrd if idx in rare_words else idx for idx in i[0].split()])
ds.append([res])
return ds
ds = remove_jnk(ds, rare_words)
this is too slow and it's taking hours to run
Maybe try importing a library suck as NLTK and do something like:
import nltk
tokens = [] #your word list
freq_dist = nltk.FreqDist(tokens)
FreqDist() is used to get the distribution of the terms in the corpus, selecting the rarest one into a list
rarewords = freq_dist.keys()[-5:]
after_rare_words = [ word for word in token not in rarewords]

Find anagrams of a given sentence from a list of words

I have a sentence with no spaces and only lowercase letters, for example:
"johndrinksmilk"
and a list of words, which contains only words that could be anagrams of the sentence above, also these words are in alphabetical order, for example:
["drink","drinks","john","milk","milks"]
I want to create a function (without using libraries) which returns a tuple of three words that together can form the anagram of the given sentence. This tuple has to be the last possible anagram of the sentence. If the words in the given list can't be used to form the given sentence, the function should return None. Since I know I'm very bad at explaining things I'll try to give you some examples:
For example, with:
sentence = "johndrinksmilk"
g_list = ["drink","drinks","john","milk","milks"]
the result should be:
r_result = ("milks","john","drink")
while these results should be wrong:
w_result = ("drinks","john","milk")
w_result = None
w_result = ("drink","john","milks")
I tried this:
def find_anagram(sentence, g_list):
g_list.reverse()
for fword in g_list:
if g_list.index(fword) == len(g_list)-1:
break
for i in range(len(fword)):
sentence_1 = sentence.replace(fword[i],"",1)
if sentence_1 == "":
break
count2 = g_list.index(fword)+1
for sword in g_list[count2:]:
if g_list.index(sword) == len(g_list)-1:
break
for i in range(len(sword)):
if sword.count(sword[i]) > sentence_1.count(sword[i]):
break
else:
sentence_2 = sentence_1.replace(sword[i],"",1)
count3 = g_list.index(sword)+1
if sentence_2 == "":
break
for tword in g_list[count3:]:
for i in range(len(tword)):
if tword.count(tword[i]) != sentence_2.count(tword[i]):
break
else:
return (fword,sword,tword)
return None
but instead of returning:
("milks","john","drink")
it returns:
None
Can anyone please tell me what's wrong? If you think my function is bad feel free to show me a different approach (but still without using libraries), because I have the feeling my function is both complex and very slow (and wrong of course...).
Thanks for your time.
Edit: new examples as requested.
sentence = "markeatsbread"
a_list = ["bread","daerb","eats","kram","mark","stae"] #these are all the possibles anagrams
the correct result is:
result = ["stae","mark","daerb"]
wrong results should be:
result = ["mark","eats","bread"] #this could be a possible anagram, but I need the last possible one
result = None #can't return None because there's at least one anagram
Try this and see if it works with all of your cases:
def findAnagram(sentence, word_list):
word_list.reverse()
for f_word in word_list:
if word_list[-1] == f_word:
break
index1 = word_list.index(f_word) + 1
for s_word in word_list[index1:]:
if word_list[-1] == s_word: break
index2 = word_list.index(s_word) + 1
for t_word in word_list[index2:]:
if (sorted(list(f_word + s_word + t_word)) == sorted(list(sentence))):
return (f_word, s_word, t_word)
Hopefully this helps you

python sorting from large to small

text1 = 'We are not what we should be We are not what we need to be But at least we are not what we used to be -- Football Coach'
i have this text1 and here is the code for it
def word_dict(all_words):
word_dict = {}
# YOUR CODE HERE
text = text1.split()
for line in text:
words = line.split()
for word in words:
word = word.lower()
if not word in word_dict:
word_dict[word] = 1
else:
word_dict[word] = word_dict[word] + 1
return word_dict
I am counting each and every word in text one , but i want to do when i am displaying , i want to displaying from the lagest to the smallest number of counts of word
you can use sorted to sort words by length.
textArr = text1.split()
result = sorted(textArr , key = len)

dictionaries feature extraction Python

I'm doing a text categorization experiment. For the feature extraction phase I'm trying to create a feature dictionary per document. For now, I have two features, Type token ratio and n-grams of the relative frequency of function words. When I print my instances, only the feature type token ratio is in the dictionary. This seems to be because an ill functioning get_pos(). It returns empty lists.
This is my code:
instances = []
labels = []
directory = "\\Users\OneDrive\Data"
for dname, dirs, files in os.walk(directory):
for fname in files:
fpath = os.path.join(dname, fname)
with open(fpath,'r') as f:
text = csv.reader(f, delimiter='\t')
vector = {}
#TTR
lemmas = get_lemmas(text)
unique_lem = set(lemmas)
TTR = str(len(unique_lem) / len(lemmas))
name = fname[:5]
vector['TTR'+ '+' + name] = TTR
#function word ngrams
pos = get_pos(text)
fw = []
regex = re.compile(
r'(LID)|(VNW)|(ADJ)|(TW)|(VZ)|(VG)|(BW)')
for tag in pos:
if regex.search(tag):
fw.append(tag)
for n in [1,2,3]:
grams = ngrams(fw, n)
fdist = FreqDist(grams)
total = sum(c for g,c in fdist.items())
for gram, count in fdist.items():
vector['fw'+str(n)+'+'+' '+ name.join(gram)] = count/total
instances.append(vector)
labels.append(fname[:1])
print(instances)
And this is an example of a Dutch input file:
This is the code from the get_pos function, which I call from another script:
def get_pos(text):
row4=[]
pos = []
for row in text:
if not row:
continue
else:
row4.append(row[4])
pos = [x.split('(')[0] for x in row4] # remove what's between the brackets
return pos
Can you help me find what's wrong with the get_pos function?
When you call get_lemmas(text), all contents of the file are consumed, so get_pos(text) has nothing left to iterate over. If you want to go through a file's content multiple times, you need to either f.seek(0) between the calls, or read the rows into a list in the beginning and iterate over the list when needed.

Resources