I'm a beginner in python and I have a small question. There is a plain text with timing tags (hours:minutes:seconds). 'Some text 1' and 'Some text 3' are advertisements. I need to calculate the total duration of the advertisement in the text. How to do it?
<time=”08:00:00"><type=ad> some text 1 <time=”08:02:24"></type=ad> some text 2 <time=”08:10:18"><type=ad> some text 3 <time=”08:12:20"></type=ad>
If each time stamp is preceded by the string, then you should find the number of occurrences in the string using the count method e.g. num = string.count("time=")
Then you can use a for loop over this range to find the actual strings.
times = []
index = 0
for i in range(0, num):
time = string.index("time=", index) # search for "time=", starting from index
times.append(string[time+6:time+14])
index = time+1
Then you can split the strings into hour-minute-seconds integer strings in a list like so:
hms = [timestr.split(":") for timestr in times]
In all, the following full code should work sufficiently
string = '<time=”08:00:00"><type=ad> some text 1 <time=”08:02:24"></type=ad> some text 2 <time=”08:10:18"><type=ad> some text 3 <time=”08:12:20"></type=ad>'
num = string.count("time=")
times = []
index = 0
for i in range(0, num):
time = string.index("time=", index) # search for "time=", starting from index
times.append(string[time+6:time+14])
index = time+1
print(times)
hms = [timestr.split(":") for timestr in times]
print(hms)
Leave a comment if you have any questions
Related
I need to write a program that prompts for the name of a text file and prints the words with the maximum and minimum frequency, along with their frequency (separated by a space).
This is my text
I am Sam
Sam I am
That Sam-I-am
That Sam-I-am
I do not like
that Sam-I-am
Do you like
green eggs and ham
I do not like them
Sam-I-am
I do not like
green eggs and ham
Code:
file = open(fname,'r')
dict1 = []
for line in file:
line = line.lower()
x = line.split(' ')
if x in dict1:
dict1[x] += 1
else:
dict1[x] = 1
Then I wanted to iterate over the keys and values and find out which one was the max and min frequency however up to that point my console says
TypeError: list indices must be integers or slices, not list
I don't know what that means either.
For this problem the expected result is:
Max frequency: i 5
Min frequency: you 1
you are using a list instead of a dictionary to store the word frequencies. You can't use a list to store key-value pairs like this, you need to use a dictionary instead. Here is how you could modify your code to use a dictionary to store the word frequencies:
file = open(fname,'r')
word_frequencies = {} # use a dictionary to store the word frequencies
for line in file:
line = line.lower()
words = line.split(' ')
for word in words:
if word in word_frequencies:
word_frequencies[word] += 1
else:
word_frequencies[word] = 1
Then to iterate over the keys and find the min and max frequency
# iterate over the keys and values in the word_frequencies dictionary
# and find the word with the max and min frequency
max_word = None
min_word = None
max_frequency = 0
min_frequency = float('inf')
for word, frequency in word_frequencies.items():
if frequency > max_frequency:
max_word = word
max_frequency = frequency
if frequency < min_frequency:
min_word = word
min_frequency = frequency
Print the results
print("Max frequency:", max_word, max_frequency)
print("Min frequency:", min_word, min_frequency)
text1 = 'We are not what we should be We are not what we need to be But at least we are not what we used to be -- Football Coach'
i have this text1 and here is the code for it
def word_dict(all_words):
word_dict = {}
# YOUR CODE HERE
text = text1.split()
for line in text:
words = line.split()
for word in words:
word = word.lower()
if not word in word_dict:
word_dict[word] = 1
else:
word_dict[word] = word_dict[word] + 1
return word_dict
I am counting each and every word in text one , but i want to do when i am displaying , i want to displaying from the lagest to the smallest number of counts of word
you can use sorted to sort words by length.
textArr = text1.split()
result = sorted(textArr , key = len)
I am currently doing a project for my Python course which is sentiment analysis for tweets. We have only just finished reading/writing files and stuff like split() and strip() using Python is class, so I am still a noob at programming.
The project involves two files, the keywords.txt file and tweets.txt file, sample of the files are:
sample of tweets.txt:
[41.298669629999999, -81.915329330000006] 6 2011-08-28 19:02:36 Work needs to fly by ... I'm so excited to see Spy Kids 4 with then love of my life ... ARREIC
[33.702900329999999, -117.95095704000001] 6 2011-08-28 19:03:13 Today
is going to be the greatest day of my life. Hired to take pictures at
my best friend's gparents 50th anniversary. 60 old people. Woo.
where the numbers in the brackets are coordinates, the numbers after that can be ignored, then the message/tweets comes after.
sample of keywords.txt:
alone,1
amazed,10
excited,10
love,10
where numbers represent the "sentimental value" of that keyword
What I am supposed to do is to read both files in Python, separate the words in each message/tweet and then check if any of the keywords is in each of the tweets, if the keywords are in the tweet, the add together the sentimental values. Finally, print the total of the sentiment values of each tweet, ignoring the tweets that does not contain any keywords.
So for example, the first tweet in the sample, two keywords are in the tweet (excited and love), so the total sentimental values would be 20.
however, in my code, it prints out the sentimental values separately as 10, 10, rather than printing out the total. And I also have no idea how to make it so that the function for checking keywords iterate over every tweet.
My code so far:
tweets = open("tweets.txt","r")
keywords = open("keywords.txt","r")
def tweetDataExtract (infile):
line = infile.readline()
if line == "":
return []
else:
parts = line.split(" ",5)
return parts
def keywordsDataExtract (infile):
line = infile.readline()
if line == "":
return[]
else:
parts = line.split(",",1)
return parts
tweetData = tweetDataExtract(tweets)
while len (tweetData) == 6:
lat = float(tweetData[0].strip("[,"))
long = float(tweetData[1].rstrip("]"))
message = tweetData[5].split(" ")
messageWords=[]
#gets rid of all the punctuation in the strip() brackets
for element in message:
element = element.strip("!#.,?[]{}#-_-:)('=/%;&*+|<>`~\n")
messageWords.append(element.lower())
tweetData = tweetDataExtract(tweets)
print(lat, long, messageWords)
keywordsData = keywordsDataExtract(keywords)
while len (keywordsData) == 2:
words = keywordsData[0]
happiness = int(keywordsData[1])
keywordsData = keywordsDataExtract(keywords)
count = 0
sentiment = 0
if words in messageWords:
sentiment+=happiness
count+=1
print (lat, long, count, sentiment)
tweets.close()
keywords.close()
How can I fix the code?
PS I didn't know which part of the code would be essential to post, so I just posted the whole thing so far.
The problem was that you had initialised the variables count and sentiment inside the while loop itself. I hope you realise its consequences!!
Corrected code :
tweets = open("tweets.txt","r")
keywords = open("keywords.txt","r")
def tweetDataExtract (infile):
line = infile.readline()
if line == "\n":
# print("hello")
return tweetDataExtract(infile)
else:
parts = line.split(" ",5)
return parts
keywordsData = [line.split(',') for line in keywords]
tweetData = tweetDataExtract(tweets)
while len(tweetData) == 6:
lat = float(tweetData[0].strip("[,"))
long = float(tweetData[1].rstrip("]"))
message = tweetData[5].split(" ")
messageWords=[]
#gets rid of all the punctuation in the strip() brackets
for element in message:
element = element.strip("!#.,?[]{}#-_-:)('=/%;&*+|<>`~\n")
messageWords.append(element.lower())
tweetData = tweetDataExtract(tweets)
count = 0
sentiment = 0
for i in range(0,len (keywordsData)):
words = keywordsData[i][0]
happiness = int(keywordsData[i][1].strip())
if words in messageWords:
sentiment+=happiness
count+=1
print (lat, long, count, sentiment)
tweets.close()
keywords.close()
See this new code (shorter and pythonic):
import string
dic = {}
tweets = []
with open("tweets.txt",'r') as f:
tweets = [line.strip() for line in f if line.strip() != '']
with open("keywords.txt",'r') as f:
dic = {line.strip().split(',')[0]:line.strip().split(',')[1] for line in f if line.strip()!=''}
for t in tweets:
t = t.split(" ",5)
lat = float(t[0].strip("[,"))
lon = float(t[1].rstrip("]"))
sentiment = 0
for word in t[5].translate(str.maketrans("","", string.punctuation)).lower().split():
if word in dic:
sentiment+=int(dic[word])
print(lat,lon,sentiment)
Output:
41.29866963 -81.91532933 20
33.70290033 -117.95095704 0
I'm trying to generate code to return the number of substrings within an input that are in sequential alphabetical order.
i.e. Input: 'abccbaabccba'
Output: 2
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def cake(x):
for i in range(len(x)):
for j in range (len(x)+1):
s = x[i:j+1]
l = 0
if s in alphabet:
l += 1
return l
print (cake('abccbaabccba'))
So far my code will only return 1. Based on tests I've done on it, it seems it just returns a 1 if there are letters in the input. Does anyone see where I'm going wrong?
You are getting the output 1 every time because your code resets the count to l = 0 on every pass through the loop.
If you fix this, you will get the answer 96, because you are including a lot of redundant checks on empty strings ('' in alphabet returns True).
If you fix that, you will get 17, because your test string contains substrings of length 1 and 2, as well as 3+, that are also substrings of the alphabet. So, your code needs to take into account the minimum substring length you would like to consider—which I assume is 3:
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def cake(x, minLength=3):
l = 0
for i in range(len(x)):
for j in range(i+minLength, len(x)): # carefully specify both the start and end values of the loop that determines where your substring will end
s = x[i:j]
if s in alphabet:
print(repr(s))
l += 1
return l
print (cake('abccbaabccba'))
I have to count how often a certain string is contained in a cell-array. The problem is the code is way to slow it takes almost 1 second in order to do this.
uniqueWordsSize = 6; % just a sample number
wordsCounter = zeros(uniqueWordsSize, 1);
uniqueWords = unique(words); % words is a cell-array
for i = 1:uniqueWordsSize
wordsCounter(i) = sum(strcmp(uniqueWords(i), words));
end
What I'm currently doing is to compare every word in uniqueWords with the cell-array words and use sum in order to calculate the sum of the array which gets returned by strcmp.
I hope someone can help me to optimize that.... 1 second for 6 words is just too much.
EDIT: ismember is even slower.
You can drop the loop completely by using the third output of unique together with hist:
words = {'a','b','c','a','a','c'}
[uniqueWords,~,wordOccurrenceIdx]=unique(words)
nUniqueWords = length(uniqueWords);
counts = hist(wordOccurrenceIdx,1:nUniqueWords)
uniqueWords =
'a' 'b' 'c'
wordOccurrenceIdx =
1 2 3 1 1 3
counts =
3 1 2
tricky way without using explicit fors..
clc
close all
clear all
Paragraph=lower(fileread('Temp1.txt'));
AlphabetFlag=Paragraph>=97 & Paragraph<=122; % finding alphabets
DelimFlag=find(AlphabetFlag==0); % considering non-alphabets delimiters
WordLength=[DelimFlag(1), diff(DelimFlag)];
Paragraph(DelimFlag)=[]; % setting delimiters to white space
Words=mat2cell(Paragraph, 1, WordLength-1); % cut the paragraph into words
[SortWords, Ia, Ic]=unique(Words); %finding unique words and their subscript
Bincounts = histc(Ic,1:size(Ia, 1));%finding their occurence
[SortBincounts, IndBincounts]=sort(Bincounts, 'descend');% finding their frequency
FreqWords=SortWords(IndBincounts); % sorting words according to their frequency
FreqWords(1)=[];SortBincounts(1)=[]; % dealing with remaining white space
Freq=SortBincounts/sum(SortBincounts)*100; % frequency percentage
%% plot
NMostCommon=20;
disp(Freq(1:NMostCommon))
pie([Freq(1:NMostCommon); 100-sum(Freq(1:NMostCommon))], [FreqWords(1:NMostCommon), {'other words'}]);