Add filenames to multiple for loops - python-3.x

I have a list of file names, like this.
file_names = ['file1', 'file2']
Also, I have a list of key words I am trying to extract from some files. So, the list of key words (list_1, list_2) and the text string that come from file1 and file2 are below,
## list_1 keywords
list_1 = ['hi', 'hello']
## list_2 keywords
list_2 = ['I', 'am']
## Text strings from file_1 and file_2
big_list = ['hi I am so and so how are you', 'hello hope all goes well by the way I can help you']
My function to extract text,
def my_func(text_string, key_words):
sentences = re.findall(r"([^.]*\.)" ,text_string)
for sentence in sentences:
if all(word in sentence for word in key_words):
return sentence
Now, I am going through multiple lists with two different for loops (as shown below) and with the funciton. After end of each iteration of these multiple for loops, I want to save the file with the filenames from file_names list.
for a,b in zip(list_1,list_2):
for item in big_list:
sentence_1 = my_func(item, a.split(' '))
sentence_2 = my_func(item, b.split(' '))
## Here I would like to add the file name i.e (print(filename))
print(sentence_1)
print(sentence_2)
I need an output that looks like this,
file1 is:
None
file2 is:
None
You can ignore None in my output now, as my main focus is to iterate though filename list and add them to my output. I would appreciate any help to achieve this.

You can access the index in Python for loops and use this index to find the file to which the string corresponds. With this you can print out the current file.
Here is an example of how you can do it:
for a,b in zip(list_1,list_2):
# idx is the index here
for idx, item in enumerate(big_list):
sentence_1 = extract_text(item, a)
sentence_2 = extract_text(item, b)
prefix = file_names[idx] + " is: " # Use idx to get the file from the file list
if sentence_1 is not None:
print(prefix + sentence_1)
if sentence_2 is not None:
print(prefix + sentence_2)
Update:
If you want to print the results after the iteration you can save temporarily the results in a dictionary and then loop through it:
for a,b in zip(list_1,list_2):
# idx is the index here
resMap = {}
for idx, item in enumerate(big_list):
sentence_1 = extract_text(item, a)
sentence_2 = extract_text(item, b)
if sentence_1 is not None:
resMap[file_names[idx]] = sentence_1
if sentence_2 is not None:
resMap[file_names[idx]] = sentence_2
for k in resMap.keys():
prefix = k + " is: " # Use idx to get the file from the file list
print (prefix + resMap[k])

Related

I want to make a dictionary of trigrams out of a text file, but something is wrong and I do not know what it is

I have written a program which is counting trigrams that occur 5 times or more in a text file. The trigrams should be printed out according to their frequency.
I cannot find the problem!
I get the following error message:
list index out of range
I have tried to make the range bigger but that did not work out
f = open("bsp_file.txt", encoding="utf-8")
text = f.read()
f.close()
words = []
for word in text.split():
word = word.strip(",.:;-?!-–—_ ")
if len(word) != 0:
words.append(word)
trigrams = {}
for i in range(len(words)):
word = words[i]
nextword = words[i + 1]
nextnextword = words[i + 2]
key = (word, nextword, nextnextword)
trigrams[key] = trigrams.get(key, 0) + 1
l = list(trigrams.items())
l.sort(key=lambda x: x[1])
l.reverse()
for key, count in l:
if count < 5:
break
word = key[0]
nextword = key[1]
nextnextword = key[2]
print(word, nextword, nextnextword, count)
The result should look like this:(simplified)
s = "this is a trigram which is an example............."
this is a
is a trigram
a trigram which
trigram which is
which is an
is an example
As the comments pointed out, you're iterating over your list words with i, and you try to access words[i+1], when i will reach the last cell of words, i+1 will be out of range.
I suggest you read this tutorial to generate n-grams with pure python: http://www.albertauyeung.com/post/generating-ngrams-python/
Answer
If you don't have much time to read it all here's the function I recommend adaptated from the link:
def get_ngrams_count(words, n):
# generates a list of Tuples representing all n-grams
ngrams_tuple = zip(*[words[i:] for i in range(n)])
# turn the list into a dictionary with the counts of all ngrams
ngrams_count = {}
for ngram in ngrams_tuple:
if ngram not in ngrams_count:
ngrams_count[ngram] = 0
ngrams_count[ngram] += 1
return ngrams_count
trigrams = get_ngrams_count(words, 3)
Please note that you can make this function a lot simpler by using a Counter (which subclasses dict, so it will be compatible with your code) :
from collections import Counter
def get_ngrams_count(words, n):
# turn the list into a dictionary with the counts of all ngrams
return Counter(zip(*[words[i:] for i in range(n)]))
trigrams = get_ngrams_count(words, 3)
Side Notes
You can use the bool argument reverse in .sort() to sort your list from most common to least common:
l = list(trigrams.items())
l.sort(key=lambda x: x[1], reverse=True)
this is a tad faster than sorting your list in ascending order and then reverse it with .reverse()
A more generic function for the printing of your sorted list (will work for any n-grams and not just tri-grams):
for ngram, count in l:
if count < 5:
break
# " ".join(ngram) will combine all elements of ngram in a string, separated with spaces
print(" ".join(ngram), count)

Never resets list

I am trying to create a calorie counter the standard input goes like this:
python3 calories.txt < test.txt
Inside calories the food is the following format: apples 500
The problem I am having is that whenever I calculate the values for the person it seems to never return to an empty list..
import sys
food = {}
eaten = {}
finished = {}
total = 0
#mappings
def calories(x):
with open(x,"r") as file:
for line in file:
lines = line.strip().split()
key = " ".join(lines[0:-1])
value = lines[-1]
food[key] = value
def calculate(x):
a = []
for keys,values in x.items():
for c in values:
try:
a.append(int(food[c]))
except:
a.append(100)
print("before",a)
a = []
total = sum(a) # Problem here
print("after",a)
print(total)
def main():
calories(sys.argv[1])
for line in sys.stdin:
lines = line.strip().split(',')
for c in lines:
values = lines[0]
keys = lines[1:]
eaten[values] = keys
calculate(eaten)
if __name__ == '__main__':
main()
Edit - forgot to include what test.txt would look like:
joe,almonds,almonds,blue cheese,cabbage,mayonnaise,cherry pie,cola
mary,apple pie,avocado,broccoli,butter,danish pastry,lettuce,apple
sandy,zuchini,yogurt,veal,tuna,taco,pumpkin pie,macadamia nuts,brazil nuts
trudy,waffles,waffles,waffles,chicken noodle soup,chocolate chip cookie
How to make it easier on yourself:
When reading the calories-data, convert the calories to int() asap, no need to do it every time you want to sum up somthing that way.
Dictionary has a .get(key, defaultvalue) accessor, so if food not found, use 100 as default is a 1-liner w/o try: ... except:
This works for me, not using sys.stdin but supplying the second file as file as well instead of piping it into the program using <.
I modified some parsings to remove whitespaces and return a [(name,cal),...] tuplelist from calc.
May it help you to fix it to your liking:
def calories(x):
with open(x,"r") as file:
for line in file:
lines = line.strip().split()
key = " ".join(lines[0:-1])
value = lines[-1].strip() # ensure no whitespaces in
food[key] = int(value)
def getCal(foodlist, defValueUnknown = 100):
"""Get sum / total calories of a list of ingredients, unknown cost 100."""
return sum( food.get(x,defValueUnknown ) for x in foodlist) # calculate it, if unknown assume 100
def calculate(x):
a = []
for name,foods in x.items():
a.append((name, getCal(foods))) # append as tuple to list for all names/foods eaten
return a
def main():
calories(sys.argv[1])
with open(sys.argv[2]) as f: # parse as file, not piped in via sys.stdin
for line in f:
lines = line.strip().split(',')
for c in lines:
values = lines[0].strip()
keys = [x.strip() for x in lines[1:]] # ensure no whitespaces in
eaten[values] = keys
calced = calculate(eaten) # calculate after all are read into the dict
print (calced)
Output:
[('joe', 1400), ('mary', 1400), ('sandy', 1600), ('trudy', 1000)]
Using sys.stdin and piping just lead to my console blinking and waiting for manual input - maybe VS related...

how to update contents of file in python

def update():
global mylist
i = j = 0
mylist[:]= []
key = input("enter student's tp")
myf = open("data.txt","r+")
ml = myf.readlines()
#print(ml[1])
for line in ml:
words = line.split()
mylist.append(words)
print(mylist)
l = len(mylist)
w = len(words)
print(w)
print(l)
for i in range(l):
for j in range(w):
print(mylist[i][j])
## if(key == mylist[i][j]):
## print("found at ",i,j)
## del mylist[i][j]
## mylist[i].insert((j+1), "xxx")
below is the error
print(mylist[i][j])
IndexError: list index out of range
I am trying to update contents in a file. I am saving the file in a list as lines and each line is then saved as another list of words. So "mylist" is a 2D list but it is giving me error with index
Your l variable is the length of the last line list. Others could be shorter.
A better idiom is to use a for loop to iterate over a list.
But there is an even better way.
It appears you want to replace a "tp" (whatever that is) with the string xxx everywhere. A quicker way to do that would be to use regular expressions.
import re
with open('data.txt') as myf:
myd = myf.read()
newd = re.sub(key, 'xxx', myd)
with open('newdata.txt', 'w') ad newf:
newf.write(newd)

Creating a dictionary to count the number of occurrences of Sequence IDs

I'm trying to write a function to count the number of each sequence ID that occurs in this file (it's a sample blast file)
The picture above is the input file I'm dealing with.
def count_seq(input):
dic1={}
count=0
for line in input:
if line.startswith('#'):
continue
if line.find('hits found'):
line=line.split('\t')
if line[1] in dic1:
dic1[line]+=1
else:
dic1[line]=1
return dic1
Above is my code which when called just returns empty brackets {}
So I'm trying to count how many times each of the sequence IDs (second element of last 13 lines) occur eg: FO203510.1 occurs 4 times.
Any help would be appreciated immensely, thanks!
Maybe this is what you're after:
def count_seq(input_file):
dic1={}
with open(input_file, "r") as f:
for line in f:
line = line.strip()
if not line.startswith('#'):
line = line.split()
seq_id = line[1]
if not seq_id in dic1:
dic1[seq_id] = 1
else:
dic1[seq_id] += 1
return dic1
print(count_seq("blast_file"))
This is a fitting case for collections.defaultdict. Let f be the file object. Assuming the sequences are in the second column, it's only a few lines of code as shown.
from collections import defaultdict
d = defaultdict(int)
seqs = (line.split()[1] for line in f if not line.strip().startswith("#"))
for seq in seqs:
d[seq] += 1
See if it works!

How to merge two lists at a delimited token in python3

I am a CS major at the University of Alabama, we have a project in our python class and I am stuck...probably for some stupid reason, but I cant seem to find the answer.
here is the link to the project, as it would be a pain to try and explain on here.
http://beastie.cs.ua.edu/cs150/projects/project1.html
here is my code:
import sys
from scanner import scan
def clInput():
#Gets command line input
log1 = sys.argv[1]
log2 = sys.argv[2]
name = sys.argv[3]
if len(sys.argv) != 4:
print('Incorrect number of arguments, should be 3')
sys.exit(1)
return log1,log2,name
def openFiles(log1,log2):
#Opens sys.argv[1]&[2] for reading
f1 = open(log1, 'r')
f2 = open(log2, 'r')
return f1, f2
def merge(log1,log2):
#Merges parsed logs into list without '---'
log1Parse = [[]]
log2Parse = [[]]
log1Count = 0
log2Count = 0
for i in log1:
if i != ['---']:
log1Parse[log1Count].append(i)
else:
log1Count += 1
log1Parse.append([])
for i in log2:
if i != ['---']:
log2Parse[log2Count].append(i)
else:
log2Count += 1
log2Parse.append([])
return(log1Parse[0] + log2Parse[0] + log1Parse[1] + log2Parse[1])
def searchMerge(name,merged):
#Searches Merged list for sys.argv[3]
for i in range(len(merged)):
if (merged[i][1] == name):
print(merged[i][0],merged[i][1]," ".join(merged[i][2:]))
def main():
log1,log2,name = clInput()
f1,f2 = openFiles(log1,log2)
#Sets the contents of the two scanned files to variables
tokens1 = scan(f1)
tokens2 = scan(f2)
#Call to merge and search
merged = merge(tokens1,tokens2)
searchMerge(name,merged)
main()
ok. so heres the problem. We are to merge two lists together into a sorted master list, delimited at the ---'s
my two log files match the ones posted on the website i linked to above. This code works, however if there are more than two instances of the ---'s in each list, it will not jump to the next list to get the other tokens, and so forth. I have it working for two with the merge function. at the end of that function i return
return(log1Parse[0] + log2Parse[0] + log1Parse[1] + log2Parse[1])
but this only works for two instances of ---. Is there anyway i can change my return to look at all of the indexes instead of having to manually put in [0],[1],[2], etc.? I need it to delimit and merge for an arbitrary amount. Please help!!
p.s. disregard the noobness...im a novice, we all gotta start somewhere
p.p.s. - the from scanner import scan is a scanner i wrote to take in all of the tokens in a given list
so.py:
import sys
def main():
# check and load command line arguments
# your code
if len(sys.argv) != 4:
print('Incorrect number of arguments, should be 3')
sys.exit(1)
# open files using file io
# your code
f1 = open(log1, 'r')
f2 = open(log2, 'r')
# list comprehension to process and filter log files
l1 = [ x.strip().split(" ",2) for x in f1.readlines() if x.strip() != "---" ]
l2 = [ x.strip().split(" ",2) for x in f2.readlines() if x.strip() != "---" ]
f1.close()
f2.close()
sorted_merged_lists = sorted(l1 + l2)
results = [ x for x in sorted_merged_lists if x[1] == name ]
for result in results:
print result
main()
CLI:
$ python so.py log1.txt log2.txt Matt
['12:06:12', 'Matt', 'Logged In']
['13:30:07', 'Matt', 'Opened Terminal']
['15:02:00', 'Matt', 'Opened Evolution']
['15:31:16', 'Matt', 'Logged Out']
docs:
http://docs.python.org/release/3.0.1/tutorial/datastructures.html#list-comprehensions
http://docs.python.org/release/3.0.1/library/stdtypes.html?highlight=strip#str.strip
http://docs.python.org/release/3.0.1/library/stdtypes.html?highlight=split#str.split
http://docs.python.org/release/3.0.1/library/functions.html?highlight=sorted#sorted

Resources