How to iterate through all keys within dic with same values one by one with sequence - python-3.x

I'm working on some text file which contains too many words and i want to get all words with there length . For example first i wanna get all words who's length is 2 and the 3 then 4 up to 15 for example
Word = this , length = 4
hate :4
love :4
that:4
china:5
Great:5
and so on up to 15
I was trying to do with this following code but i couldn't iterate it through all keys one by one .And through this code I'm able to get just words which has the length 5 but i want this loop to start it from 2 to up to 15 with sequence
text = open(r"C:\Users\israr\Desktop\counter\Bigdata.txt")
d = dict()
for line in text:
line = line.strip()
line = line.lower()
words = line.split(" ")
for word in words:
if word not in d:
d[word] = len(word)
def getKeysByValue(d, valueToFind):
listOfKeys = list()
listOfItems = d.items()
for item in listOfItems:
if item[1] == valueToFind:
listOfKeys.append(item[0])
return listOfKeys
listOfKeys = getKeysByValue(d, 5)
print("Keys with value equal to 5")
#Iterate over the list of keys
for key in listOfKeys:
print(key)

What I have done is:
Changed the structure of your dictionary:
In your version of dictionary, a "word" has to be the key having value equal to its length. Like this:
{"hate": 4, "love": 4}
New version:
{4: ["hate", "love"], 5:["great", "china"]} Now the keys are integers and values are lists of words. For instance, if key is 4, the value will be a list of all words from the file with length 4.
After that, the code is populating dictionary from the data read from file. If the key is not present in the dictionary it is created otherwise the words are added to the list against that key.
Keys are sorted and their values are printed. That is all words of that length are printed in sequence.
You Forgot to close the file in your code. Its a good practice to release any resource being used by a program when it finishes execution. (To avoid Resource or Memory Leak and other such errors). Most of the time this can be done by just closing that resource. Closing the file, for instance, releases the file and it can thus be used by other program now.
# 24-Apr-2020
# 03:11 AM (GMT +05)
# TALHA ASGHAR
# Open the file to read data from
myFile = open(r"books.txt")
# create an empty dictionary where we will store word counts
# format of data in dictionary will be:
# {1: [words from file of length 1], 2:[words from file of length 2], ..... so on }
d = dict()
# iterate over all the lines of our file
for line in myFile:
# get words from the current line
words = line.lower().strip().split(" ")
# iterate over each word form the current line
for word in words:
# get the length of this word
length = len(word)
# there is no word of this length in the dictionary
# create a list against this length
# length is the key, and the value is the list of words with this length
if length not in d.keys():
d[length] = [word]
# if there is already a word of this length append current word to that list
else:
d[length].append(word)
for key in sorted(d.keys()):
print(key, end=":")
print(d[key])
myFile.close()

Your first part of code is correct, dictionary d will give you all the unique words with their respective length.
Now you want to get all the words with their length, as shown below:
{'this':4, 'that':4, 'water':5, 'china':5, 'great':5.......till length 15}
To get such dictionary you can sort the dictionary by their values as below.
import operator
sorted_d = sorted(d.items(), key=operator.itemgetter(1))
sorted_d will be in the below format:
{'this':4, 'that':4, 'water':5, 'china':5, 'great':5,......., 'abcdefghijklmno':15,...}

Related

What's the best way to search a text file for consecutive repetitions and return the text with highest number of them?

I'm extremely new to programming in general and have only been learning Python for 1 week.
For a class, I have to analyze a text DNA sequence, something like this:
CTAGATAGATAGATAGATAGATGACTA
for these specific keys: AGAT,AATG,TATC
I have to keep track of the largest number of consecutive repetitions for each, disregarding all but the highest number of repetitions.
I've been pouring over previous stackoverflow answers and I saw groupby() suggested as a way to do this. I'm not exactly sure how to use groupby for my specific implementation needs though.
It seems like I will have to read the text sequence from a file into a list. Can I import what is essentially a text string into a list? Do I have to separate all of the characters by commas? Will groupby work on a string?
It also looks like groupby would give me the highest incident of consecutive repetitions, but in the form of a list. How would I get the highest result from that list out of that list to them be stored somewhere else, without me the programmer having to look at the result? Will groupby return the highest number of consecutive repeats first in the list? Or will it be placed in order of when it occured in the list?
Is there a function I can use to isolate and return the sequence with the highest repetition incidence, so that I can compare that with the dictionary file I've been provided with?
Frankly, I really could use some help breaking down the groupby function in general.
My assignment recommended possibly using a slice to accomplish this, and that seemed somehow more daunting to try, but if that's the way to go, please let me know, and I wouldn't turn down a mudge in the direction on how in the heck to do that.
Thank you in advance for any and all wisdom on this.
Here's a similar solution to the previous post, but may have better readability.
# The DNA Sequence
DNA = "CTAGATAGATAGATAGATAGATGACTAGCTAGATAGATAGATAGATAGATGACTAGAGATAGATAGATCTAG"
# All Sequences of Interest
elements = {"AGAT", "AATG", "TATC"}
# Add Elements to A Dictionary
maxSeq = {}
for element in elements:
maxSeq[element] = 0
# Find Max Sequence for Each Element
for element in elements:
i = 0
curCount = 0
# Ensure DNA Length Not Reached
while i+4 <= len(DNA):
# Sequence Not Being Tracked
if curCount == 0:
# Sequence Found
if DNA[i: i + 4] == element:
curCount = 1
i += 4
# Sequence Not Found
else: i += 1
# Sequence Is Being Tracked
else:
# Sequence Found
if DNA[i: i + 4] == element:
curCount += 1
i += 4
# Sequence Not Found
else:
# Check If Previous Max Was Beat
if curCount > maxSeq[element]:
maxSeq[element] = curCount
# Reset Count
curCount = 0
i += 1
#Check If Sequence Was Being Tracked At End
if curCount > maxSeq[element]: maxSeq[element] = curCount
#Display
print(maxSeq)
Output:
{'AGAT': 5, 'TATC': 0, 'AATG': 0}
This doesn't seem like a groupby problem since you want multiple groups of the same key. It would easier to just scan the list for key counts.
# all keys (keys are four chars each)
seq = "CTAGATAGATAGATAGATAGATGACTAGCTAGATAGATAGATAGATAGATGACTAGAGATAGATAGATCTAG"
# split key string into list of keys: ["CTAG","ATAG","ATAG","ATAG", ....]
lst = [seq[i:i+4] for i in (range(0,len(seq),4))]
lst.append('X') # the while loop only tallies when next key found, so add fake end key
# these are the keys we care about and want to store the max consecutive counts
dicMax = { 'AGAT':0, 'AATG':0, 'TATC':0, 'ATAG':0 } #dictionary of keys and max consecutive key count
# the while loop starts at the 2nd entry, so set variables based on first entry
cnt = 1
key = lst[0] #first key in list
if (key in dicMax): dicMax[key] = 1 #store first key in case it's the max for this key
ctr = 1 # start at second entry in key list (we always compare to previous entry so can't start at 0)
while ctr < len(lst): #all keys in list
if (lst[ctr] != lst[ctr-1]): #if this key is different from previous key in list
if (key in dicMax and cnt > dicMax[key]): #if we care about this key and current count is larger than stored count
dicMax[key] = cnt #store current count as max count for this key
#set variables for next key in list
cnt = 0
key = lst[ctr]
ctr += 1 #list counter
cnt += 1 #counter for current key
print(dicMax) # max consecutive count for each key
Raiyan Chowdhury suggested that the sequences may overlap, so dividing the base sequence into four character strings may not work. In this case, we need to search for each string individually.
Note that this algorithm is not efficient, but readable to a new programmer.
seq = "CTAGATAGATAGATAGATAGATGACTAGCTAGATAGATAGATAGATAGATGACTAGAGATAGATAGATCTAG"
dicMax = { 'AGAT':0, 'AATG':0, 'TATC':0, 'ATAG':0 } #dictionary of keys and max consecutive key count
for key in dicMax: #each key, could divide and conquer here so all keys run at same time
for ctr in range(1,9999): #keep adding key to itself ABC > ABCABC > ABCABCABC
s = key * ctr #create string by repeating key "ABC" * 2 = "ABCABC"
if (s in seq): # if repeated key found in full sequence
dicMax[key]=ctr # set max (repeat) count for this key
else:
break; # exit inner for #done with this key
print(dicMax) #max consecutive key counts

I am not able to understand the code. Can any one help me out?

marks = {}
for _ in range(int(input())):
line = input().split()
marks[line[0]] = list(map(float, line[1:]))
print('%.2f' %(sum(marks[input()])/3))
I am new to python. Can you tell me the meaning of this code?
I'm not able to understand it.
What this code does:
# initialized a dictionary type names marks
marks = {}
# The input() method will pause and wait for someone to input data in the command line
# The range() method will create an array of int given the a number
# example: range(5) will create [0, 1, 2, 3, 4]
# In this case it will take the string returned from input() convert it to an integer
# and use that as the value.
# The for loop will, run as many times as there are elements "in" the array created
# the _ is just a silly variable name the developer used because
# he is not using the value in the array anywhere.
for _ in range(int(input())):
# Get a new input from the user
# split the string (it uses spaces to cut the string into an array)
# example if you type "one two three" it will create ["one", "two", "three"]
# store the array in the variable line
line = input().split()
# add/replace the element using the first string in the line as key
# line[0] is the first element in the array
# lint[1:] is the array containing all the elements starting at index 1 (the second element)
# map() is a function that will call the function float on each elements of the array given. basically building an array with the values [float(line[1]), float(line[2])…]
# list will convert the array into a list.
marks[line[0]] = list(map(float, line[1:]))
# this last line asks the user for one more value
# gets the list in the marks dictionary using the value inputed by the user
# calculates the sum of all the floats in that list.
# divides it by 3 and prints the results as a floating point number with 2 decimal places.
print('%.2f' %(sum(marks[input()])/3))

sorting a list alphabetically with numerical counting

I currently am required to sort a list lexicographic order, where repeated words are counted and are sorted accordingly. The highest recurrences are first, with alphabetical sorting happens afterwards. I am using a file to collect the first word in each line and then sorting how often that word is repeated to be printed as what word is sorted where. for example if apple has the most repeats, it will appear at the front of the printed list as apple. Below is my attempt of sorting this out:
x = open("file.txt", "r")
li1 = []
for line in x:
user = line.split()
name = user[0]
li1.append(name)
li2 = [[y,li1.count(y)] for y in set(li1)]
print(li2)
x.close()

Arrange the string in every possible correct alphabetical sequence of three characters

write a python program to Arrange the string in every possible
correct alphabetical sequence of three characters
for example :
INPUT : "ahdgbice"
OUTPUT: {'abc', 'bcd', 'ghi', 'cde'}
Can anyone Suggest me a Optimised Method to do that I have tried and Was Successful in generating the output but I am not satisfied with my code so Anyone please suggest me a proper optimised way to solve this problem.
This is probably a decent result:
>>> import itertools as it
>>> in_s="ahdgbice"
>>> in_test=''.join([chr(e) for e in range(ord(min(in_s)),ord(max(in_s))+1)])
>>> {s for s in map(lambda e: ''.join(e), (it.combinations(sorted(in_s),3))) if s in in_test}
{'abc', 'ghi', 'bcd', 'cde'}
How it works:
Generate a string that goes abc..khi in this case to test if the substring are in alphabetical order: in_test=''.join([chr(e) for e in range(ord(min(in_s)),ord(max(in_s))+1)])
Generate every combination of 3 letter substrings from a sorted in_s with map(lambda e: ''.join(e), (it.combinations(sorted(in_s),3)))
Test if the substring is sorted by testing if it is a substring of abcd..[max letter of in_s]
Solution: It's not optimised solution but it fulfil the requirement
# for using array import numpy lib
import numpy as np
#input string
str_1="ahdgbice"
#breaking the string into characters by puting it into a list.
list_1=list(str_1)
# for sorting we copy that list value in an array
arr_1=np.array(list_1)
arr_2=np.sort(arr_1)
# some temp variables
previous=0
str_2=""
list_2=list()
#logic and loops starts here : looping outer loop from 0 to length of sorted array
for outer in range(0,len(arr_2)):
#looping inner loop from outer index value to length of sorted array
for inner in range(outer,len(arr_2)):
value=arr_2[inner]
#ord() return an ascii value of characters
if(previous is 0):
previous=ord(value)
#difference between two consecutive sequence is always 1 or -1
# e.g ascii of a= 97, b=98 ,So a-b=-1 or b-a=1 and used abs() to return absolute value
if(abs(previous-ord(value)) is 1):
str_2=str_2+value # appending character with previous str_2 values
previous=ord(value) # storing current character's ascii value to previous
else:
str_2=value # assigning character value to str_2
previous=ord(value) # storing current character's ascii value to previous
# for making a string of three characters
if(len(str_2) == 3):
list_2.append(str_2)
# Logic and loops ends here
# put into the set to remove duplicate values
set_1=set(list_2)
#printing final output
print(set_1)
Output:
{'abc', 'bcd', 'ghi', 'cde'}
I would use the itertool module's permutations function to get a list of all three-element permutations of your input, and then for each result see if it is identical to a sorted version of itself.

Create a dictionary from a file

I am creating a code that allows the user to input a .txt file of their choice. So, for example, if the text read:
"I am you. You ArE I."
I would like my code to create a dictionary that resembles this:
{I: 2, am: 1, you: 2, are: 1}
Having the words in the file appear as the key, and the number of times as the value. Capitalization should be irrelevant, so are = ARE = ArE = arE = etc...
This is my code so far. Any suggestions/help?
>> file = input("\n Please select a file")
>> name = open(file, 'r')
>> dictionary = {}
>> with name:
>> for line in name:
>> (key, val) = line.split()
>> dictionary[int(key)] = val
Take a look at the examples in this answer:
Python : List of dict, if exists increment a dict value, if not append a new dict
You can use collections.Counter() to trivially do what you want, but if for some reason you can't use that, you can use a defaultdict or even a simple loop to build the dictionary you want.
Here is code that solves your problem. This will work in Python 3.1 and newer.
from collections import Counter
import string
def filter_punctuation(s):
return ''.join(ch if ch not in string.punctuation else ' ' for ch in s)
def lower_case_words(f):
for line in f:
line = filter_punctuation(line)
for word in line.split():
yield word.lower()
def count_key(tup):
"""
key function to make a count dictionary sort into descending order
by count, then case-insensitive word order when counts are the same.
tup must be a tuple in the form: (word, count)
"""
word, count = tup
return (-count, word.lower())
dictionary = {}
fname = input("\nPlease enter a file name: ")
with open(fname, "rt") as f:
dictionary = Counter(lower_case_words(f))
print(sorted(dictionary.items(), key=count_key))
From your example I could see that you wanted punctuation stripped away. Since we are going to split the string on white space, I wrote a function that filters punctuation to white space. That way, if you have a string like hello,world this will be split into the words hello and world when we split on white space.
The function lower_case_words() is a generator, and it reads an input file one line at a time and then yields up one word at a time from each line. This neatly puts our input processing into a tidy "black box" and later we can simply call Counter(lower_case_words(f)) and it does the right thing for us.
Of course you don't have to print the dictionary sorted, but I think it looks better this way. I made the sort order put the highest counts first, and where counts are equal, put the words in alphabetical order.
With your suggested input, this is the resulting output:
[('i', 2), ('you', 2), ('am', 1), ('are', 1)]
Because of the sorting it always prints in the above order.

Resources