Create a dictionary from a file - python-3.x

I am creating a code that allows the user to input a .txt file of their choice. So, for example, if the text read:
"I am you. You ArE I."
I would like my code to create a dictionary that resembles this:
{I: 2, am: 1, you: 2, are: 1}
Having the words in the file appear as the key, and the number of times as the value. Capitalization should be irrelevant, so are = ARE = ArE = arE = etc...
This is my code so far. Any suggestions/help?
>> file = input("\n Please select a file")
>> name = open(file, 'r')
>> dictionary = {}
>> with name:
>> for line in name:
>> (key, val) = line.split()
>> dictionary[int(key)] = val

Take a look at the examples in this answer:
Python : List of dict, if exists increment a dict value, if not append a new dict
You can use collections.Counter() to trivially do what you want, but if for some reason you can't use that, you can use a defaultdict or even a simple loop to build the dictionary you want.
Here is code that solves your problem. This will work in Python 3.1 and newer.
from collections import Counter
import string
def filter_punctuation(s):
return ''.join(ch if ch not in string.punctuation else ' ' for ch in s)
def lower_case_words(f):
for line in f:
line = filter_punctuation(line)
for word in line.split():
yield word.lower()
def count_key(tup):
"""
key function to make a count dictionary sort into descending order
by count, then case-insensitive word order when counts are the same.
tup must be a tuple in the form: (word, count)
"""
word, count = tup
return (-count, word.lower())
dictionary = {}
fname = input("\nPlease enter a file name: ")
with open(fname, "rt") as f:
dictionary = Counter(lower_case_words(f))
print(sorted(dictionary.items(), key=count_key))
From your example I could see that you wanted punctuation stripped away. Since we are going to split the string on white space, I wrote a function that filters punctuation to white space. That way, if you have a string like hello,world this will be split into the words hello and world when we split on white space.
The function lower_case_words() is a generator, and it reads an input file one line at a time and then yields up one word at a time from each line. This neatly puts our input processing into a tidy "black box" and later we can simply call Counter(lower_case_words(f)) and it does the right thing for us.
Of course you don't have to print the dictionary sorted, but I think it looks better this way. I made the sort order put the highest counts first, and where counts are equal, put the words in alphabetical order.
With your suggested input, this is the resulting output:
[('i', 2), ('you', 2), ('am', 1), ('are', 1)]
Because of the sorting it always prints in the above order.

Related

How to iterate through all keys within dic with same values one by one with sequence

I'm working on some text file which contains too many words and i want to get all words with there length . For example first i wanna get all words who's length is 2 and the 3 then 4 up to 15 for example
Word = this , length = 4
hate :4
love :4
that:4
china:5
Great:5
and so on up to 15
I was trying to do with this following code but i couldn't iterate it through all keys one by one .And through this code I'm able to get just words which has the length 5 but i want this loop to start it from 2 to up to 15 with sequence
text = open(r"C:\Users\israr\Desktop\counter\Bigdata.txt")
d = dict()
for line in text:
line = line.strip()
line = line.lower()
words = line.split(" ")
for word in words:
if word not in d:
d[word] = len(word)
def getKeysByValue(d, valueToFind):
listOfKeys = list()
listOfItems = d.items()
for item in listOfItems:
if item[1] == valueToFind:
listOfKeys.append(item[0])
return listOfKeys
listOfKeys = getKeysByValue(d, 5)
print("Keys with value equal to 5")
#Iterate over the list of keys
for key in listOfKeys:
print(key)
What I have done is:
Changed the structure of your dictionary:
In your version of dictionary, a "word" has to be the key having value equal to its length. Like this:
{"hate": 4, "love": 4}
New version:
{4: ["hate", "love"], 5:["great", "china"]} Now the keys are integers and values are lists of words. For instance, if key is 4, the value will be a list of all words from the file with length 4.
After that, the code is populating dictionary from the data read from file. If the key is not present in the dictionary it is created otherwise the words are added to the list against that key.
Keys are sorted and their values are printed. That is all words of that length are printed in sequence.
You Forgot to close the file in your code. Its a good practice to release any resource being used by a program when it finishes execution. (To avoid Resource or Memory Leak and other such errors). Most of the time this can be done by just closing that resource. Closing the file, for instance, releases the file and it can thus be used by other program now.
# 24-Apr-2020
# 03:11 AM (GMT +05)
# TALHA ASGHAR
# Open the file to read data from
myFile = open(r"books.txt")
# create an empty dictionary where we will store word counts
# format of data in dictionary will be:
# {1: [words from file of length 1], 2:[words from file of length 2], ..... so on }
d = dict()
# iterate over all the lines of our file
for line in myFile:
# get words from the current line
words = line.lower().strip().split(" ")
# iterate over each word form the current line
for word in words:
# get the length of this word
length = len(word)
# there is no word of this length in the dictionary
# create a list against this length
# length is the key, and the value is the list of words with this length
if length not in d.keys():
d[length] = [word]
# if there is already a word of this length append current word to that list
else:
d[length].append(word)
for key in sorted(d.keys()):
print(key, end=":")
print(d[key])
myFile.close()
Your first part of code is correct, dictionary d will give you all the unique words with their respective length.
Now you want to get all the words with their length, as shown below:
{'this':4, 'that':4, 'water':5, 'china':5, 'great':5.......till length 15}
To get such dictionary you can sort the dictionary by their values as below.
import operator
sorted_d = sorted(d.items(), key=operator.itemgetter(1))
sorted_d will be in the below format:
{'this':4, 'that':4, 'water':5, 'china':5, 'great':5,......., 'abcdefghijklmno':15,...}

Counter() function in for loop looks odd

I have a list of sequences to be found in the sequencing data. So I run a for loop to find the match sequences in a dataset, and used Counter() to get the maximum sequences. But I found the Counter() function would add previous loop data, not as separate one.
ls = ['AGC', 'GCT', 'TAC', 'CGT']
dataset.txt like a bunch of sequences of "AGTAGCTTT", "AGTTAGC"......
def xfind(seq):
ls2 = []
with open(dataset.txt, 'r') as f:
for line in f:
if seq in line:
ls2.append(line)
import collections
from collections import Counter
cnt = Counter()
for l in ls2:
cnt[l] += 1
print (cnt.most_common()[0])
for l2 in ls:
xfind(l2)
The results look like:
('AGTAGCTTT", 2)
('AGTAGCTTT", 5)
It should be:
('AGTAGCTTT', 2)
('GCT...', 3)
I'm not sure you understand your code very well and your use of Counter isn't really how it's intended to be used I think.
You start by checking if the substring is in the sequence (line) for each line of the text file, and if it is you add it to a list ls2
Then for every element of that list (which are the whole lines/sequences from the text file) you add 1 to the counter for that key. You do this in a loop, when the whole point of Counter is that you can simply call:
cnt = Counter(ls2)
This all means that you are reporting the most common sequence in the file, which also contains the given subsequence.
Now it is actually a bit hard to say what your exact output should be, without knowing what your dataset.txt looks like.
I would start by tidying up the code a little:
from collections import Counter
subsequences = ['AGC', 'GCT', 'TAC', 'CGT']
def xfind(subseq):
contains_ss = []
with open("dataset.txt", 'r') as f:
for line in f:
if subseq in line:
contains_ss.append(line)
cnt = Counter(contains_ss)
print(cnt.most_common()[0])
for ss in subsequences:
xfind(ss)

How do I iterate through a list I'm converting to a matrix from a string?

I'm very new to Python, and I'm currently going through the Matrix problem on Exercism. It's been pretty simple to this point, but I'm having trouble figuring out the best way to iterate through this list. I've been scouring through the web trying to figure it out on my own.
What I have to do is convert the string provided at the bottom to a matrix at the /n break, then to an integer so that I can use numpy to return either the rows or columns.
class Matrix(object):
def __init__(self, matrix_string):
self.matrix = matrix_string.replace(' ', '').split('\n')
self.temp = [int(i) for i in self.matrix[0]]
matrix = Matrix("1 2\n3 4")
print(matrix.temp)
I've gotten as far as converting the list, but now I need to return the entire list, not just a particular index as shown above, into a new temp list. That's where I'm stuck.
While you could convert your list as you are doing there are some problems with the following line
self.temp = [int(i) for i in self.matrix[0]]
Consider converting "12 1" your result would be 1, 2, 1 rather than 12, 1
Instead you should split on newlines and then split each line on spaces (without replacing ' ' with '')
It would be a good exercise to try to do this by yourself but I will give you a hint and a solution for if you get stuck
Hint: initialize temp as an empty list first, then append lists to it one by one
More in depth hint approach: Initialize temp as an empty list. Split matrix on newlines. Append each line to temp, split on spaces.
solution:
self.temp = []
self.matrix = matrix_string.split('\n')
for element in self.matrix:
self.temp.append(element.split(' '))
this can be simplified to one line of list comprehension
you can try that yourself or use the below
self.temp = [element.split(' ') for element in matrix_string.split('\n')]

How to remove marks from string and turn it into a list

I need to create a function that turns the string to a list without !?., %#$ . and without capital letters. The string at the end is just an example so it needs to return ['mr', 'stark', 'i', "don't", 'feel', 'so', 'good']
Can someone tell me why my code prints None?
def sentence_to_words(s):
# Write the rest of the code for question 2 below here.
s_new= []
s1 = s.split()
a = ['#',',','!','.','?','$']
for i in s.split():
if i in a:
s2 = s1.remove(i)
s_new = s_new.append(s2)
return s_new
print sentence_to_words("Mr. Stark... I don't feel so good")
The best way to debug this is to validate that your assumptions about program state hold on each step. Don't jump ahead until you're sure each line of code does what you expect. Adding a print inside your loop shows exactly what i is on each iteration:
Mr.
Stark...
I
don't
feel
so
good
None of these words are in a = ['#',',','!','.','?','$'], so the conditional block inside your loop never runs. After the loop is exhausted, your program returns None which Python functions return when no return value is specified.
Furthermore, your conditional block operations aren't working as you expect; check return values and avoid making assignments if they're an in-place operation such as .append(), which returns None and should not be assigned to anything. Also, if the if block does execute, it'll prematurely return the result without finishing work on the rest of the list.
You may be looking for something like this:
def sentence_to_words(s):
s_new = []
ignore = ["#", "!", ",", ".", "?", "$"]
for word in s.split():
cleaned_word = ""
for letter in list(word):
if letter not in ignore:
cleaned_word += letter
s_new.append(cleaned_word.lower())
return s_new
print sentence_to_words("Mr. Stark... I don't feel so good")
Output:
['mr', 'stark', 'i', "don't", 'feel', 'so', 'good']
The approach in the above example is to iterate over words, then iterate over letters in each word to clean them according to the requirements and add the clean word to the result array. Note the descriptive variable names, which aid in understanding the program (for example, i was actually a word in your code, but i usually means integer or index).
The above example can be optimized--it uses a lot of error-prone arrays and loops, the ignore list should be a parameter to make the function reusable, and the in operator is slow on lists (ignore should be a set). Using regex makes it a one-liner:
import re
def sentence_to_words(s):
return re.sub(r"[\#\,\!\.\?\$]", "", s).lower().split()
Or using filter and the list of characters to ignore as a default parameter:
def sentence_to_words(s, ignore=set("#!,.?$")):
return filter(lambda x: x not in ignore, s).lower().split()
Try it!
I couldn't understand your code very well, but where's an alternative using re.sub and split().
We first remove any special chars with re.sub an then use split to get a list of words, i.e.:
import re
sentence = "Mr. Stark... I don't feel so good"
words = re.sub(r"[#,!\?\$.]", "", s).split()
Using re.split:
words = re.split("[^a-z'-]+", sentence, 0, re.IGNORECASE)
Both examples output:
# ['Mr', 'Stark', 'I', 'don't', 'feel', 'so', 'good']
Ideone Demo

how to use file to re copy a string using position and words only

use a file to create the sentence
sentence = 'the cat sat on the cat mat'
indivdual_words = ['the', 'cat', 'sat', 'on', 'mat']
positions = [1, 2, 3, 4, 1, 2, 5]
f = open('word_file.txt', 'w+')
f.write(str(words))
f.close()
f = open('pos_file.txt', 'w+')
f.write(str(positions))
f.close()
program should see 1 as the and 2 as cat etc
Since you're storing everything as strings, you'll end up with file contents that match a valid python expression. You can use ast.literal_eval to get the actual python object out of the string representation.
from ast import literal_eval
with open('word_file.txt') as f:
data = f.read().strip()
words = ast.literal_eval(data)
with open('pos_file.txt') as f:
data = f.read().strip()
pos = ast.literal_eval(data)
Then just do the opposite of what you did before.
result = " ".join([words[i-1] for i in pos])
Since you're dumping the representation of the lists, the best way is to read them back using ast.literal_eval
import ast
with open('word_file.txt') as f:
indivdual_words = ast.literal_eval(f.read())
with open('pos_file.txt') as f:
positions = ast.literal_eval(f.read())
then recreate the sentence using a list comprehension to generate the words in sequence, joined with spaces:
sentence = " ".join([indivdual_words[i-1] for i in positions])
result:
the cat sat on the cat mat
After you create the reading and writing capable file objects (w for word file, n for index file):
1) iterate through the word file object, appending each word to an empty list
2) iterate through the index file object, assigning each word in word list to temporary variable word via index from index file object, and then add that word to the originally empty sentence you are trying to form.
word_list = []
for word in w:
wordlist.append(word)
sentence = ''
for index in n:
word = wordlist[index]
sentence+= word
sentence+= ' '

Resources