python sorting from large to small - python-3.x

text1 = 'We are not what we should be We are not what we need to be But at least we are not what we used to be -- Football Coach'
i have this text1 and here is the code for it
def word_dict(all_words):
word_dict = {}
# YOUR CODE HERE
text = text1.split()
for line in text:
words = line.split()
for word in words:
word = word.lower()
if not word in word_dict:
word_dict[word] = 1
else:
word_dict[word] = word_dict[word] + 1
return word_dict
I am counting each and every word in text one , but i want to do when i am displaying , i want to displaying from the lagest to the smallest number of counts of word

you can use sorted to sort words by length.
textArr = text1.split()
result = sorted(textArr , key = len)

Related

Timecode Tags Calculation

I'm a beginner in python and I have a small question. There is a plain text with timing tags (hours:minutes:seconds). 'Some text 1' and 'Some text 3' are advertisements. I need to calculate the total duration of the advertisement in the text. How to do it?
<time=”08:00:00"><type=ad> some text 1 <time=”08:02:24"></type=ad> some text 2 <time=”08:10:18"><type=ad> some text 3 <time=”08:12:20"></type=ad>
If each time stamp is preceded by the string, then you should find the number of occurrences in the string using the count method e.g. num = string.count("time=")
Then you can use a for loop over this range to find the actual strings.
times = []
index = 0
for i in range(0, num):
time = string.index("time=", index) # search for "time=", starting from index
times.append(string[time+6:time+14])
index = time+1
Then you can split the strings into hour-minute-seconds integer strings in a list like so:
hms = [timestr.split(":") for timestr in times]
In all, the following full code should work sufficiently
string = '<time=”08:00:00"><type=ad> some text 1 <time=”08:02:24"></type=ad> some text 2 <time=”08:10:18"><type=ad> some text 3 <time=”08:12:20"></type=ad>'
num = string.count("time=")
times = []
index = 0
for i in range(0, num):
time = string.index("time=", index) # search for "time=", starting from index
times.append(string[time+6:time+14])
index = time+1
print(times)
hms = [timestr.split(":") for timestr in times]
print(hms)
Leave a comment if you have any questions

Count the number of times a word is repeated in a text file

I need to write a program that prompts for the name of a text file and prints the words with the maximum and minimum frequency, along with their frequency (separated by a space).
This is my text
I am Sam
Sam I am
That Sam-I-am
That Sam-I-am
I do not like
that Sam-I-am
Do you like
green eggs and ham
I do not like them
Sam-I-am
I do not like
green eggs and ham
Code:
file = open(fname,'r')
dict1 = []
for line in file:
line = line.lower()
x = line.split(' ')
if x in dict1:
dict1[x] += 1
else:
dict1[x] = 1
Then I wanted to iterate over the keys and values and find out which one was the max and min frequency however up to that point my console says
TypeError: list indices must be integers or slices, not list
I don't know what that means either.
For this problem the expected result is:
Max frequency: i 5
Min frequency: you 1
you are using a list instead of a dictionary to store the word frequencies. You can't use a list to store key-value pairs like this, you need to use a dictionary instead. Here is how you could modify your code to use a dictionary to store the word frequencies:
file = open(fname,'r')
word_frequencies = {} # use a dictionary to store the word frequencies
for line in file:
line = line.lower()
words = line.split(' ')
for word in words:
if word in word_frequencies:
word_frequencies[word] += 1
else:
word_frequencies[word] = 1
Then to iterate over the keys and find the min and max frequency
# iterate over the keys and values in the word_frequencies dictionary
# and find the word with the max and min frequency
max_word = None
min_word = None
max_frequency = 0
min_frequency = float('inf')
for word, frequency in word_frequencies.items():
if frequency > max_frequency:
max_word = word
max_frequency = frequency
if frequency < min_frequency:
min_word = word
min_frequency = frequency
Print the results
print("Max frequency:", max_word, max_frequency)
print("Min frequency:", min_word, min_frequency)

Comparing keywords from one file to another file containing different tweet

I am currently doing a project for my Python course which is sentiment analysis for tweets. We have only just finished reading/writing files and stuff like split() and strip() using Python is class, so I am still a noob at programming.
The project involves two files, the keywords.txt file and tweets.txt file, sample of the files are:
sample of tweets.txt:
[41.298669629999999, -81.915329330000006] 6 2011-08-28 19:02:36 Work needs to fly by ... I'm so excited to see Spy Kids 4 with then love of my life ... ARREIC
[33.702900329999999, -117.95095704000001] 6 2011-08-28 19:03:13 Today
is going to be the greatest day of my life. Hired to take pictures at
my best friend's gparents 50th anniversary. 60 old people. Woo.
where the numbers in the brackets are coordinates, the numbers after that can be ignored, then the message/tweets comes after.
sample of keywords.txt:
alone,1
amazed,10
excited,10
love,10
where numbers represent the "sentimental value" of that keyword
What I am supposed to do is to read both files in Python, separate the words in each message/tweet and then check if any of the keywords is in each of the tweets, if the keywords are in the tweet, the add together the sentimental values. Finally, print the total of the sentiment values of each tweet, ignoring the tweets that does not contain any keywords.
So for example, the first tweet in the sample, two keywords are in the tweet (excited and love), so the total sentimental values would be 20.
however, in my code, it prints out the sentimental values separately as 10, 10, rather than printing out the total. And I also have no idea how to make it so that the function for checking keywords iterate over every tweet.
My code so far:
tweets = open("tweets.txt","r")
keywords = open("keywords.txt","r")
def tweetDataExtract (infile):
line = infile.readline()
if line == "":
return []
else:
parts = line.split(" ",5)
return parts
def keywordsDataExtract (infile):
line = infile.readline()
if line == "":
return[]
else:
parts = line.split(",",1)
return parts
tweetData = tweetDataExtract(tweets)
while len (tweetData) == 6:
lat = float(tweetData[0].strip("[,"))
long = float(tweetData[1].rstrip("]"))
message = tweetData[5].split(" ")
messageWords=[]
#gets rid of all the punctuation in the strip() brackets
for element in message:
element = element.strip("!#.,?[]{}#-_-:)('=/%;&*+|<>`~\n")
messageWords.append(element.lower())
tweetData = tweetDataExtract(tweets)
print(lat, long, messageWords)
keywordsData = keywordsDataExtract(keywords)
while len (keywordsData) == 2:
words = keywordsData[0]
happiness = int(keywordsData[1])
keywordsData = keywordsDataExtract(keywords)
count = 0
sentiment = 0
if words in messageWords:
sentiment+=happiness
count+=1
print (lat, long, count, sentiment)
tweets.close()
keywords.close()
How can I fix the code?
PS I didn't know which part of the code would be essential to post, so I just posted the whole thing so far.
The problem was that you had initialised the variables count and sentiment inside the while loop itself. I hope you realise its consequences!!
Corrected code :
tweets = open("tweets.txt","r")
keywords = open("keywords.txt","r")
def tweetDataExtract (infile):
line = infile.readline()
if line == "\n":
# print("hello")
return tweetDataExtract(infile)
else:
parts = line.split(" ",5)
return parts
keywordsData = [line.split(',') for line in keywords]
tweetData = tweetDataExtract(tweets)
while len(tweetData) == 6:
lat = float(tweetData[0].strip("[,"))
long = float(tweetData[1].rstrip("]"))
message = tweetData[5].split(" ")
messageWords=[]
#gets rid of all the punctuation in the strip() brackets
for element in message:
element = element.strip("!#.,?[]{}#-_-:)('=/%;&*+|<>`~\n")
messageWords.append(element.lower())
tweetData = tweetDataExtract(tweets)
count = 0
sentiment = 0
for i in range(0,len (keywordsData)):
words = keywordsData[i][0]
happiness = int(keywordsData[i][1].strip())
if words in messageWords:
sentiment+=happiness
count+=1
print (lat, long, count, sentiment)
tweets.close()
keywords.close()
See this new code (shorter and pythonic):
import string
dic = {}
tweets = []
with open("tweets.txt",'r') as f:
tweets = [line.strip() for line in f if line.strip() != '']
with open("keywords.txt",'r') as f:
dic = {line.strip().split(',')[0]:line.strip().split(',')[1] for line in f if line.strip()!=''}
for t in tweets:
t = t.split(" ",5)
lat = float(t[0].strip("[,"))
lon = float(t[1].rstrip("]"))
sentiment = 0
for word in t[5].translate(str.maketrans("","", string.punctuation)).lower().split():
if word in dic:
sentiment+=int(dic[word])
print(lat,lon,sentiment)
Output:
41.29866963 -81.91532933 20
33.70290033 -117.95095704 0

dictionaries feature extraction Python

I'm doing a text categorization experiment. For the feature extraction phase I'm trying to create a feature dictionary per document. For now, I have two features, Type token ratio and n-grams of the relative frequency of function words. When I print my instances, only the feature type token ratio is in the dictionary. This seems to be because an ill functioning get_pos(). It returns empty lists.
This is my code:
instances = []
labels = []
directory = "\\Users\OneDrive\Data"
for dname, dirs, files in os.walk(directory):
for fname in files:
fpath = os.path.join(dname, fname)
with open(fpath,'r') as f:
text = csv.reader(f, delimiter='\t')
vector = {}
#TTR
lemmas = get_lemmas(text)
unique_lem = set(lemmas)
TTR = str(len(unique_lem) / len(lemmas))
name = fname[:5]
vector['TTR'+ '+' + name] = TTR
#function word ngrams
pos = get_pos(text)
fw = []
regex = re.compile(
r'(LID)|(VNW)|(ADJ)|(TW)|(VZ)|(VG)|(BW)')
for tag in pos:
if regex.search(tag):
fw.append(tag)
for n in [1,2,3]:
grams = ngrams(fw, n)
fdist = FreqDist(grams)
total = sum(c for g,c in fdist.items())
for gram, count in fdist.items():
vector['fw'+str(n)+'+'+' '+ name.join(gram)] = count/total
instances.append(vector)
labels.append(fname[:1])
print(instances)
And this is an example of a Dutch input file:
This is the code from the get_pos function, which I call from another script:
def get_pos(text):
row4=[]
pos = []
for row in text:
if not row:
continue
else:
row4.append(row[4])
pos = [x.split('(')[0] for x in row4] # remove what's between the brackets
return pos
Can you help me find what's wrong with the get_pos function?
When you call get_lemmas(text), all contents of the file are consumed, so get_pos(text) has nothing left to iterate over. If you want to go through a file's content multiple times, you need to either f.seek(0) between the calls, or read the rows into a list in the beginning and iterate over the list when needed.

How to rewrite code

My comment is down below with the code
My program censors words. It works for both one word and many words. I was having trouble making the program work for many words. It would print out the sentence with the space censored too. I found code to make it work though but do not understand it.
sentence = input("Enter a sentence:")
word = input("Enter a word to replace:")
words = word
def censorWord(sentence,word):
# I would like to rewrite this code in a way I can understand and read clearer.
return " ".join(["-"*len(item) if item in word else item for item in sentence.split()])
def censorWords(sentence,words):
words1 = words.split()
for w in words1:
if w in sentence:
return replaceWord(sentence,word)
print(censorWords(sentence,words))
def censorWord (sentence, word):
result = [] #list to store the new sentence words
eachword = sentence.split() #splits each word in the sentence and store in a seperate array element
for item in eachword: #iterates the list until last word
if item == word: #if current list item matches the given word then insert - for length of the word
item = "-"*len(word)
result.append(item) #add the word to the list
return " ".join(result) #join all the words in the list with space in between
You can rewrite:
s = " ".join(["-" * len(item) if item in word else item for item in sentence.split()])
Into:
arr = []
for item in sentence.split():
if item in word:
arr.append("-" * len(item))
else:
arr.append(item)
s = " ".join(arr)
It basically splits sentence by spacing. Then if the current item is in word, then it gets replaced with it's own length in hyphens.
You seem to be a bit confused censorWord() is censoring all words in the sentence and censorWords() looks like it is trying to do the same thing but returns in the middle of processing. Just looking at censorWord():
More descriptive variable naming and breaking the one liner down would probably make it clearer, e.g.:
def redact(word):
return '-'*len(word)
def censorWord(sentence, censored_words):
words = sentence.split()
return " ".join([redact(word) if word in censored_words else word for word in words])
You can always turn this into a for loop but list comprehensions are a common part of python and you should get comfortable with them:
def censorWord(sentence, censored_words):
words = sentence.split()
clean_sentence = []
for word in words:
if word in censored_words:
clean_sentence.append(redact(word))
else:
clean_sentence.append(word)
return " ".join(clean_sentence)

Resources