tab character is not identified while parsing - python-3.x

I have strings which have 2 tab characters as
# File contains multiple lines like this
'T1 Original 210 227 Extra Mile'
'T8 Modified 1646 1655 Tickets'
# Eg: "Tx" "indication" "start_index" "end_index" "word"
# 'T1\tOriginal 210 227\tExtra Mile'
I want the word after the second tab. so I am trying to find the index of '\t' and replace the initial string as empty.
def find_index(s, ch):
return [i for i, ltr in enumerate(s) if ltr == ch]
def extract_words(filename):
extracted_data = [line.rstrip('\n') for line in open(filename)]
search_key = '\t'
for i in range(len(extracted_data)):
indices = find_index(extracted_data[i], search_key)
extracted_data[i] = extracted_data[i].replace(extracted_data[i][:indices[-1]], '')
return extracted_data
but it does not identify the '\t' as indices output is [].
what is causing the problem ?
the expected output
'Extra Mile'
'Tickets'

Some of your lines do not contain tabs - hence no indexes, hence IndexError.
Use:
if len(indices)>1: # only extract by slicing if indexes found!
to check for that.
Why so complex? Use str.split("\t"):
def extract_words(filename):
with open(filename) as f:
lines = [x.strip() for x in f.readlines()]
k = []
for l in lines:
try:
k.append(l.split("\t")[2])
except IndexError:
print (f"no 2 tabs in '{l}'")
return k
t = """T1\tOriginal 210 227\tExtra Mile
T8\tModified 1646 1655\tTickets
Error\ttext"""
fn = "t.txt"
with open(fn,"w") as f:
f.write(t)
print(*extract_words(fn), sep="\n")
Output:
no 2 tabs in 'Error text'
Extra Mile
Tickets
This will work on lines with 2 tabs and report any that do not have those.

Related

Add filenames to multiple for loops

I have a list of file names, like this.
file_names = ['file1', 'file2']
Also, I have a list of key words I am trying to extract from some files. So, the list of key words (list_1, list_2) and the text string that come from file1 and file2 are below,
## list_1 keywords
list_1 = ['hi', 'hello']
## list_2 keywords
list_2 = ['I', 'am']
## Text strings from file_1 and file_2
big_list = ['hi I am so and so how are you', 'hello hope all goes well by the way I can help you']
My function to extract text,
def my_func(text_string, key_words):
sentences = re.findall(r"([^.]*\.)" ,text_string)
for sentence in sentences:
if all(word in sentence for word in key_words):
return sentence
Now, I am going through multiple lists with two different for loops (as shown below) and with the funciton. After end of each iteration of these multiple for loops, I want to save the file with the filenames from file_names list.
for a,b in zip(list_1,list_2):
for item in big_list:
sentence_1 = my_func(item, a.split(' '))
sentence_2 = my_func(item, b.split(' '))
## Here I would like to add the file name i.e (print(filename))
print(sentence_1)
print(sentence_2)
I need an output that looks like this,
file1 is:
None
file2 is:
None
You can ignore None in my output now, as my main focus is to iterate though filename list and add them to my output. I would appreciate any help to achieve this.
You can access the index in Python for loops and use this index to find the file to which the string corresponds. With this you can print out the current file.
Here is an example of how you can do it:
for a,b in zip(list_1,list_2):
# idx is the index here
for idx, item in enumerate(big_list):
sentence_1 = extract_text(item, a)
sentence_2 = extract_text(item, b)
prefix = file_names[idx] + " is: " # Use idx to get the file from the file list
if sentence_1 is not None:
print(prefix + sentence_1)
if sentence_2 is not None:
print(prefix + sentence_2)
Update:
If you want to print the results after the iteration you can save temporarily the results in a dictionary and then loop through it:
for a,b in zip(list_1,list_2):
# idx is the index here
resMap = {}
for idx, item in enumerate(big_list):
sentence_1 = extract_text(item, a)
sentence_2 = extract_text(item, b)
if sentence_1 is not None:
resMap[file_names[idx]] = sentence_1
if sentence_2 is not None:
resMap[file_names[idx]] = sentence_2
for k in resMap.keys():
prefix = k + " is: " # Use idx to get the file from the file list
print (prefix + resMap[k])

Never resets list

I am trying to create a calorie counter the standard input goes like this:
python3 calories.txt < test.txt
Inside calories the food is the following format: apples 500
The problem I am having is that whenever I calculate the values for the person it seems to never return to an empty list..
import sys
food = {}
eaten = {}
finished = {}
total = 0
#mappings
def calories(x):
with open(x,"r") as file:
for line in file:
lines = line.strip().split()
key = " ".join(lines[0:-1])
value = lines[-1]
food[key] = value
def calculate(x):
a = []
for keys,values in x.items():
for c in values:
try:
a.append(int(food[c]))
except:
a.append(100)
print("before",a)
a = []
total = sum(a) # Problem here
print("after",a)
print(total)
def main():
calories(sys.argv[1])
for line in sys.stdin:
lines = line.strip().split(',')
for c in lines:
values = lines[0]
keys = lines[1:]
eaten[values] = keys
calculate(eaten)
if __name__ == '__main__':
main()
Edit - forgot to include what test.txt would look like:
joe,almonds,almonds,blue cheese,cabbage,mayonnaise,cherry pie,cola
mary,apple pie,avocado,broccoli,butter,danish pastry,lettuce,apple
sandy,zuchini,yogurt,veal,tuna,taco,pumpkin pie,macadamia nuts,brazil nuts
trudy,waffles,waffles,waffles,chicken noodle soup,chocolate chip cookie
How to make it easier on yourself:
When reading the calories-data, convert the calories to int() asap, no need to do it every time you want to sum up somthing that way.
Dictionary has a .get(key, defaultvalue) accessor, so if food not found, use 100 as default is a 1-liner w/o try: ... except:
This works for me, not using sys.stdin but supplying the second file as file as well instead of piping it into the program using <.
I modified some parsings to remove whitespaces and return a [(name,cal),...] tuplelist from calc.
May it help you to fix it to your liking:
def calories(x):
with open(x,"r") as file:
for line in file:
lines = line.strip().split()
key = " ".join(lines[0:-1])
value = lines[-1].strip() # ensure no whitespaces in
food[key] = int(value)
def getCal(foodlist, defValueUnknown = 100):
"""Get sum / total calories of a list of ingredients, unknown cost 100."""
return sum( food.get(x,defValueUnknown ) for x in foodlist) # calculate it, if unknown assume 100
def calculate(x):
a = []
for name,foods in x.items():
a.append((name, getCal(foods))) # append as tuple to list for all names/foods eaten
return a
def main():
calories(sys.argv[1])
with open(sys.argv[2]) as f: # parse as file, not piped in via sys.stdin
for line in f:
lines = line.strip().split(',')
for c in lines:
values = lines[0].strip()
keys = [x.strip() for x in lines[1:]] # ensure no whitespaces in
eaten[values] = keys
calced = calculate(eaten) # calculate after all are read into the dict
print (calced)
Output:
[('joe', 1400), ('mary', 1400), ('sandy', 1600), ('trudy', 1000)]
Using sys.stdin and piping just lead to my console blinking and waiting for manual input - maybe VS related...

How to calculate from a dictionary in python

import operator
with open("D://program.txt") as f:
Results = {}
for line in f:
part_one,part_two = line.split()
Results[part_one] = part_two
c=sum(int(Results[x]) for x in Results)
r=c/12
d=len(Results)
F=max(Results.items(), key=operator.itemgetter(1))[0]
u=min(Results.items(), key=operator.itemgetter(1))[0]
print ("Number of entries are",d)
print ("Student with HIGHEST mark is",F)
print ("Student with LOWEST mark is",u)
print ("Avarage mark is",r)
Results = [ (v,k) for k,v in Results.items() ]
Results.sort(reverse=True)
for v,k in Results:
print(k,v)
import sys
orig_stdout = sys.stdout
f = open('D://programssr.txt', 'w')
sys.stdout = f
print ('Number of entries are',d)
print ("Student with HIGHEST mark is",F)
print ("Student with LOWEST mark is",u)
print ("Avarage mark is",r)
for v,k in Results:
print(k,v)
sys.stdout = orig_stdout
f.close()
I want to read a txt file but problem is it cant compute the results i want to write in a new file because of the NAMES and MARKS in file.if you remove them it works fine.i want to make calculations without removing NAMES and MARKS in txt file..Help what i am i doing wrong
NAMES MARKS
Lux 95
Veron 70
Lesley 88
Sticks 80
Tipsey 40
Joe 62
Goms 18
Wesley 35
Villa 11
Dentist 72
Onty 50
Just consume the first line using next() function, before looping over it:
with open("D://program.txt") as f:
Results = {}
next(f)
for line in f:
part_one,part_two = line.split()
Results[part_one] = part_two
Note that file objects are iterator-like object (one shot iterable) and when you loop over them you consume the items and you have no access to them anymore.

Creating a dictionary to count the number of occurrences of Sequence IDs

I'm trying to write a function to count the number of each sequence ID that occurs in this file (it's a sample blast file)
The picture above is the input file I'm dealing with.
def count_seq(input):
dic1={}
count=0
for line in input:
if line.startswith('#'):
continue
if line.find('hits found'):
line=line.split('\t')
if line[1] in dic1:
dic1[line]+=1
else:
dic1[line]=1
return dic1
Above is my code which when called just returns empty brackets {}
So I'm trying to count how many times each of the sequence IDs (second element of last 13 lines) occur eg: FO203510.1 occurs 4 times.
Any help would be appreciated immensely, thanks!
Maybe this is what you're after:
def count_seq(input_file):
dic1={}
with open(input_file, "r") as f:
for line in f:
line = line.strip()
if not line.startswith('#'):
line = line.split()
seq_id = line[1]
if not seq_id in dic1:
dic1[seq_id] = 1
else:
dic1[seq_id] += 1
return dic1
print(count_seq("blast_file"))
This is a fitting case for collections.defaultdict. Let f be the file object. Assuming the sequences are in the second column, it's only a few lines of code as shown.
from collections import defaultdict
d = defaultdict(int)
seqs = (line.split()[1] for line in f if not line.strip().startswith("#"))
for seq in seqs:
d[seq] += 1
See if it works!

How can I simplify and format this function?

So I have this messy code where I wanted to get every word from frankenstein.txt, sort them alphabetically, eliminated one and two letter words, and write them into a new file.
def Dictionary():
d = []
count = 0
bad_char = '~!##$%^&*()_+{}|:"<>?\`1234567890-=[]\;\',./ '
replace = ' '*len(bad_char)
table = str.maketrans(bad_char, replace)
infile = open('frankenstein.txt', 'r')
for line in infile:
line = line.translate(table)
for word in line.split():
if len(word) > 2:
d.append(word)
count += 1
infile.close()
file = open('dictionary.txt', 'w')
file.write(str(set(d)))
file.close()
Dictionary()
How can I simplify it and make it more readable and also how can I make the words write vertically in the new file (it writes in a horizontal list):
abbey
abhorred
about
etc....
A few improvements below:
from string import digits, punctuation
def create_dictionary():
words = set()
bad_char = digits + punctuation + '...' # may need more characters
replace = ' ' * len(bad_char)
table = str.maketrans(bad_char, replace)
with open('frankenstein.txt') as infile:
for line in infile:
line = line.strip().translate(table)
for word in line.split():
if len(word) > 2:
words.add(word)
with open('dictionary.txt', 'w') as outfile:
outfile.writelines(sorted(words)) # note 'lines'
A few notes:
follow the style guide
string contains constants you can use to provide the "bad characters";
you never used count (which was just len(d) anyway);
use the with context manager for file handling; and
using a set from the start prevents duplicates, but they aren't ordered (hence sorted).
Using re module.
import re
words = set()
with open('frankenstein.txt') as infile:
for line in infile:
words.extend([x for x in re.split(r'[^A-Za-z]*', line) if len(x) > 2])
with open('dictionary.txt', 'w') as outfile:
outfile.writelines(sorted(words))
From r'[^A-Za-z]*' in re.split, replace 'A-Za-z' with the characters which you want to include in dictionary.txt.

Resources