String to dictionary word count and display - python-3.x

I have a homework question which asks:
Write a function print_word_counts(filename) that takes the name of a
file as a parameter and prints an alphabetically ordered list of all
words in the document converted to lower case plus their occurrence
counts (this is how many times each word appears in the file).
I am able to get an out of order set of each word with it's occurrence; however when I sort it and make it so each word is on a new line the count disappears.
import re
def print_word_counts(filename):
input_file = open(filename, 'r')
source_string = input_file.read().lower()
input_file.close()
words = re.findall('[a-zA-Z]+', source_string)
counts = {}
for word in words:
counts[word] = counts.get(word, 0) + 1
sorted_count = sorted(counts)
print("\n".join(sorted_count))
When I run this code I get:
a
aborigines
absence
absolutely
accept
after
and so on.
What I need is:
a: 4
aborigines: 1
absence: 1
absolutely: 1
accept: 1
after: 1
I'm not sure how to sort it and keep the values.

It's a homework question, so I can't give you the full answer, but here's enough to get you started. Your mistake is in this line
sorted_count = sorted(counts)
Firstly, you cant sort a dictionary by nature. Secondly, what this does is take the keys of the dictionary, sorts them, and returns a list.
You can just print the value of counts, or, if you really need them in sorted order, consider changing the dictionary items into a list, then sorting them.
lst = list(count.items())
#sort and return lst

Related

Is there any ways to make this more efficient?

I have 24 more attempts to submit this task. I spent hours and my brain does not work anymore. I am a beginner with Python can you please help to figure out what is wrong? I would love to see the correct code if possible.
Here is the task itself and the code I wrote below.
Note that you can have access to all standard modules/packages/libraries of your language. But there is no access to additional libraries (numpy in python, boost in c++, etc).
You are given a content of CSV-file with information about set of trades. It contains the following columns:
TIME - Timestamp of a trade in format Hour:Minute:Second.Millisecond
PRICE - Price of one share
SIZE - Count of shares executed in this trade
EXCHANGE - The exchange that executed this trade
For each exchange find the one minute-window during which the largest number of trades took place on this exchange.
Note that:
You need to send source code of your program.
You have only 25 attempts to submit a solutions for this task.
You have access to all standart modules/packages/libraries of your language. But there is no access to additional libraries (numpy in python, boost in c++, etc).
Input format
Input contains several lines. You can read it from standart input or file “trades.csv”
Each line contains information about one trade: TIME, PRICE, SIZE and EXCHANGE. Numbers are separated by comma.
Lines are listed in ascending order of timestamps. Several lines can contain the same timestamp.
Size of input file does not exceed 5 MB.
See the example below to understand the exact input format.
Output format
If input contains information about k exchanges, print k lines to standart output.
Each line should contain the only number — maximum number of trades during one minute-window.
You should print answers for exchanges in lexicographical order of their names.
Sample
Input Output
09:30:01.034,36.99,100,V
09:30:55.000,37.08,205,V
09:30:55.554,36.90,54,V
09:30:55.556,36.91,99,D
09:31:01.033,36.94,100,D
09:31:01.034,36.95,900,V
2
3
Notes
In the example four trades were executed on exchange “V” and two trades were executed on exchange “D”. Not all of the “V”-trades fit in one minute-window, so the answer for “V” is three.
X = []
with open('trades.csv', 'r') as tr:
for line in tr:
line = line.strip('\xef\xbb\xbf\r\n ')
X.append(line.split(','))
dex = {}
for item in X:
dex[item[3]] = []
for item in X:
dex[item[3]].append(float(item[0][:2])*60.+float(item[0][3:5])+float(item[0][6:8])/60.+float(item[0][9:])/60000.)
for item in dex:
count = 1
ccount = 1
if dex[item][len(dex[item])-1]-dex[item][0] <1:
count = len(dex[item])
else:
for t in range(len(dex[item])-1):
for tt in range(len(dex[item])-t-1):
if dex[item][tt+t+1]-dex[item][t] <1:
ccount += 1
else: break
if ccount>count:
count=ccount
ccount=1
print(count)
First of all it is not necessary to use datetime and csv modules for such a simple case (like in Ed-Ward's example).
If we remove colon and dot signs from the time strings it could be converted to int() directly - easier way than you tried in your example.
CSV features like dialect and special formatting not used so i suggest to use simple split(",")
Now about efficiency. Efficiency means time complexity.
The more times you go through your array with dates from the beginning to the end, the more complicated the algorithm becomes.
So our goal is to minimize cycles count, best to make only one pass by all rows and especially avoid nested loops and passing through collections from beginning to the end.
For such a task it is better to use deque, instead of tuple or list, because you can pop() first element and append last element with complexity of O(1).
Just append every time for needed exchange to the end of the exchange's queue until difference between current and first elements becomes more than 1 minute. Then just remove first element with popleft() and continue comparison. After whole file done - length of each queue will be the max 1min window.
Example with linear time complexity O(n):
from collections import deque
ex_list = {}
s = open("trades.csv").read().replace(":", "").replace(".", "")
for line in s.splitlines():
s = line.split(",")
curr_tm = int(s[0])
curr_ex = s[3]
if curr_ex not in ex_list:
ex_list[curr_ex] = deque()
ex_list[curr_ex].append(curr_tm)
if curr_tm >= ex_list[curr_ex][0] + 100000:
ex_list[curr_ex].popleft()
print("\n".join([str(len(ex_list[k])) for k in sorted(ex_list.keys())]))
This code should work:
import csv
import datetime
diff = datetime.timedelta(minutes=1)
def date_calc(start, dates):
for i, date in enumerate(dates):
if date >= start + diff:
return i
return i + 1
exchanges = {}
with open("trades.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
this_exchange = row[3]
if this_exchange not in exchanges:
exchanges[this_exchange] = []
time = datetime.datetime.strptime(row[0], "%H:%M:%S.%f")
exchanges[this_exchange].append(time)
ex_max = {}
for name, dates in exchanges.items():
ex_max[name] = 0
for i, d in enumerate(dates):
x = date_calc(d, dates[i:])
if x > ex_max[name]:
ex_max[name] = x
print('\n'.join([str(ex_max[k]) for k in sorted(ex_max.keys())]))
Output:
2
3
( obviously please check it for yourself before uploading it :) )
I think the issue with your current code is that you don't put the output in lexicographical order of their names...
If you want to use your current code, then here is a (hopefully) fixed version:
X = []
with open('trades.csv', 'r') as tr:
for line in tr:
line = line.strip('\xef\xbb\xbf\r\n ')
X.append(line.split(','))
dex = {}
counts = []
for item in X:
dex[item[3]] = []
for item in X:
dex[item[3]].append(float(item[0][:2])*60.+float(item[0][3:5])+float(item[0][6:8])/60.+float(item[0][9:])/60000.)
for item in dex:
count = 1
ccount = 1
if dex[item][len(dex[item])-1]-dex[item][0] <1:
count = len(dex[item])
else:
for t in range(len(dex[item])-1):
for tt in range(len(dex[item])-t-1):
if dex[item][tt+t+1]-dex[item][t] <1:
ccount += 1
else: break
if ccount>count:
count=ccount
ccount=1
counts.append((item, count))
counts.sort(key=lambda x: x[0])
print('\n'.join([str(x[1]) for x in counts]))
Output:
2
3
I do think you can make your life easier in the future by using Python's standard library, though :)

How can I create a dictionary for a large amount to text and list the most frequent word?

I am new to coding and I am trying to create a dictionary from a large body of text and would also like the most frequent word to be shown?
For example, if I had a block of text such as:
text = '''George Gordon Noel Byron was born, with a clubbed right foot, in London on January 22, 1788. He was the son of Catherine Gordon of Gight, an impoverished Scots heiress, and Captain John (“Mad Jack”) Byron, a fortune-hunting widower with a daughter, Augusta. The profligate captain squandered his wife’s inheritance, was absent for the birth of his only son, and eventually decamped for France as an exile from English creditors, where he died in 1791 at 36.'''
I know the steps I would like the code to take. I want words that are the same but capitalised to be counted together so Hi and hi would count as Hi = 2.
I am trying to get the code to loop through the text and create a dictionary showing how many times each word appears. My final goal is to them have the code state which word appears most frequently.
I don't know how to approach such a large amount of text, the examples I have seen are for a much smaller amount of words.
I have tried to remove white space and also create a loop but I am stuck and unsure if I am going the right way about coding this problem.
a.replace(" ", "")
#this gave built-in method replace of str object at 0x000001A49AD8DAE0>, I have now idea what this means!
print(a.replace) # this is what I tried to write to remove white spaces
I am unsure of how to create the dictionary.
To count the word frequency would I do something like:
frequency = {}
for value in my_dict.values() :
if value in frequency :
frequency[value] = frequency[value] + 1
else :
frequency[value] = 1
What I was expecting to get was a dictionary that lists each word shown with a numerical value showing how often it appears in the text.
Then I wanted to have the code show the word that occurs the most.
This may be too simple for your requirements, but you could do this to create a dictionary of each word and its number of repetitions in the text.
text = "..." # text here.
frequency = {}
for word in text.split(" "):
if word not in frequency.keys():
frequency[word] = 1
else:
frequency[word] += 1
print(frequency)
This only splits the text up at each ' ' and counts the number of each occurrence.
If you want to get only the words, you may have to remove the ',' and other characters which you do not wish to have in your dictionary.
To remove characters such as ',' do.
text = text.replace(",", "")
Hope this helps and happy coding.
First, to remove all non-alphabet characters, aside from ', we can use regex
After that, we go through a list of the words and use a dictionary
import re
d = {}
text = text.split(" ")#turns it into a list
text = [re.findall("[a-zA-Z']", text[i]) for i in range(len(text))]
#each word is split, but non-alphabet/apostrophe are removed
text = ["".join(text[i]) for i in range(len(text))]
#puts each word back together
#there may be a better way for the short-above. If so, please tell.
for word in text:
if word in d.keys():
d[word] += 1
else:
d[word] = 1
d.pop("")
#not sure why, but when testing I got one key ""
You can use regex and Counter from collections :
import re
from collections import Counter
text = "This cat is not a cat, even if it looks like a cat"
# Extract words with regex, ignoring symbols and space
words = re.compile(r"\b\w+\b").findall(text.lower())
count = Counter(words)
# {'cat': 3, 'a': 2, 'this': 1, 'is': 1, 'not': 1, 'even': 1, 'if': 1, 'it': 1, 'looks': 1, 'like': 1}
# To get the most frequent
most_frequent = max(count, key=lambda k: count[k])
# 'cat'

Comparing user input list with dictionary and printing out corresponding value

Starting out by saying this is for school and I'm still learning so I'm not looking for a direct solution.
What I want to do is take an input from a user (one word or more).
I then make it in to a list.
I have my dictionary and the code that I'm posting is printing out the values correctly.
My question is how do I compare the characters in my list to the keys in the dictionary and then print only those values that correspond to the keys?
I have also read a ton of different questions regarding dictionaries but it was no help at all.
Example on output;
Word: wow
Output: 96669
user_word = input("Please enter a word: ")
user_listed = list(user_word)
def keypresses():
my_dict = {'.':1, ',':11, '?':111, '!':1111, ':':11111, 'a':2, 'b':22, 'c':222, 'd':3, 'e':33, 'f':333, 'g':4, 'h':44,
'i':444, 'j':5, 'k':55, 'l':555, 'm':6, 'n':66, 'o':666, 'p':7, 'q':77, 'r':777, 's':7777, 't':8, 'u':88,
'v':888, 'w':9, 'x':99, 'y':999, 'z':9999, ' ':0}
for key, value in my_dict.items():
print(value)
I am not going to hand you code for the project, but I will definitely send you in a right direction;
so, 2 parts to this in my view; match each character to a key/get a value, and combine the numbers for an output.
For the first part, you can iterate character-by-character by simply making a for loop;
for letter in 'string':
print(letter)
would output s t r i n g. So you can use this to find the value of the key(each letter)
Then, you can get the definition as a string(so as not to add each number mathematically) so something like;
letter = 'w'
value = my_dict[letter]
value_as_string = str(value)
then, combine this all into a for loop and add each string to each other to create the desired output.

Creating a dictionary of dictionaries from csv file

Hi so I am trying to write a function, classify(csv_file) that creates a default dictionary of dictionaries from a csv file. The first "column" (first item in each row) is the key for each entry in the dictionary and then second "column" (second item in each row) will contain the values.
However, I want to alter the values by calling on two functions (in this order):
trigram_c(string): that creates a default dictionary of trigram counts within the string (which are the values)
normal(tri_counts): that takes the output of trigram_c and normalises the counts (i.e converts the counts for each trigram into a number).
Thus, my final output will be a dictionary of dictionaries:
{value: {trigram1 : normalised_count, trigram2: normalised_count}, value2: {trigram1: normalised_count...}...} and so on
My current code looks like this:
def classify(csv_file):
l_rows = list(csv.reader(open(csv_file)))
classified = dict((l_rows[0], l_rows[1]) for rows in l_rows)
For example, if the csv file was:
Snippet1, "It was a dark stormy day"
Snippet2, "Hello world!"
Snippet3, "How are you?"
The final output would resemble:
{Snippet1: {'It ': 0.5352, 't w': 0.43232}, Snippet2: {'Hel' : 0.438724,...}...} and so on.
(Of course there would be more than just two trigram counts, and the numbers are just random for the purpose of the example).
Any help would be much appreciated!
First of all, please check classify function, because I can't run it. Here corrected version:
import csv
def classify(csv_file):
l_rows = list(csv.reader(open(csv_file)))
classified = dict((row[0], row[1]) for row in l_rows)
return classified
It returns dictionary with key from first column and value is string from second column.
So you should iterate every dictionary entry and pass its value to trigram_c function. I didn't understand how you calculated trigram counts, but for example if you just count the number of trigram appearence in string you could use the function below. If you want make other counting you just need to update code in the for loop.
def trigram_c(string):
trigram_dict = {}
start = 0
end = 3
for i in range(len(string)-2):
# you could implement your logic in this loop
trigram = string[start:end]
if trigram in trigram_dict.keys():
trigram_dict[trigram] += 1
else:
trigram_dict[trigram] = 1
start += 1
end += 1
return trigram_dict

Matching the value of a word in a list with the place value of another list

I am trying to work out how I can compare a list of words against a string and report back the word number from list one when they match. I can easily get the unique list of words from a sentence - just removing duplicates, and with enumerate I can get a value for each word, so Mary had a little lamb becomes 1, Mary, 2, had, 3, a etc. But I cannot work out how to then search the original list again and replace each word with its number value (so it becomes 1 2 3 etc).
Any ideas greatly received!
my_list.index(word)
will return the index of the item word within my_list. You can start digging into the documentation here
Thank you for this info. I can see the logic for this and it should work, however I get: line 27, in output=words.index(result) ValueError: ['word1', 'word2'] is not in list With the following code:
def remove_duplicates(words):
output = []
seen = set()
for value in words:
# If value has not been encountered yet,
# ... add it to both list and set.
if value not in seen:
output.append(value)
seen.add(value)
return output
# Remove duplicates from this list.
sentence = input("Enter a sentence ")
words = sentence.split(' ')
result = remove_duplicates(words)
print(result)
Very confusing :(
I have found an answer on here:
positions = [ i+1 for i in range(len(result)) if each == result[i]]
Which works well.

Resources