Calculate mean only for match strings - python-3.x

I have this assignment in which I have a file that contains alot of chromosed that I need to calculate for each one of them the mutation level.
The problem is that each chromosome can appear several times and I need to find the mean for all the mutation levels of this chromosome. and on top of that i need that the mutation will be in same nucleotides (T-->C or G-->A).
The mutation level is calculate by DP4 under INFO which contains four numbers that represented as [ref+,ref-,alt+,alt-]
Example of the file:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Aligned.sortedByCoord.out.bam
chr1 143755378 . T C 62 . DP=550;VDB=0;SGB=-0.693147;RPB=1.63509e-10;MQB=1;BQB=0.861856;MQ0F=0;AC=2;AN=2;DP4=0,108,0,440;MQ=20 GT:PL:DP 1/1:89,179,0:548
chr3 57644487 . T C 16.4448 . DP=300;VDB=0;SGB=-0.693147;RPB=0.993846;MQB=1;BQB=0.316525;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,166,0,134;MQ=20 GT:PL:DP 0/1:49,0,63:300
chr3 80706912 . T C 212 . DP=298;VDB=0;SGB=-0.693147;RPB=0.635135;MQB=1;MQSB=1;BQB=0.609797;MQ0F=0;AC=2;AN=2;DP4=1,1,256,40;MQ=20 GT:PL:DP 1/1:239,255,0:298
So this what I did until now and Im kinda stuck not really knowing how to continue from that point:
def vcf(file):
with open(file, "r+") as my_file:
"""First I wanted to clear the headline"""
for columns in my_file:
if columns.startswith("#"):
continue
"""Then I split the file into columns"""
for columns in my_file:
columns=columns.rstrip('\n').split('\t')
"""This is the info column"""
for row in columns[7]:
row = columns[7].split(";")
"""Using slicing I extracted the DP4 part and removed the str DP4"""
DP4 = [row[-2]]
new_DP4 = [x.replace("DP4=","") for x in DP4]
"""Then I took all the int outs and put them under the categories"""
for x in new_DP4:
xyz = x.split(",")
ref_plus = int(xyz[0])
ref_minus = int(xyz[1])
alt_plus = int(xyz[2])
alt_minus = int(xyz[3])
"""calculated the mean for each one"""
formula = ((alt_minus+alt_plus)/(alt_minus+alt_plus+ref_minus+ref_plus))
"""made a list of the chromosomes and their means"""
chr_form = [columns[0] , columns[3], columns[4], (formula)]
so basically I thought that now that I have all the data in list I can sort out somehow the same chr and do the means but I cant figure out how to do it.. I tried to use regex as well but im not that familiar with that
this is my current output for chr_form:
['chr3', 'T', 'C', 0.44666666666666666]
['chr3', 'T', 'C', 0.9932885906040269]
['chr5', 'A', 'G', 0.42073170731707316]
['chr5', 'A', 'G', 0.5772870662460567]
['chr6', 'A', 'G', 0.5153061224489796]
['chr6', 'A', 'G', 0.8934010152284264]
and so on..
but the output I want to get in the end is this:
{1: {‘T->C’: 0.802}, 3: {‘T->C’:0.446}}
I'll be happy to get an idea or example how to calculate the mean for each chr,

You have lots of unnecessary for loops. The only loop you need is for the lines in the file, you don't need to loop over the characters in fields when you're splitting them or removing something from the whole field.
At the end, you should be adding the result of the calculation to a dictionary.
def vcf(file):
chromosomes = {}
with open(file, "r+") as my_file:
# First I wanted to clear the headline
for line in my_file:
if line.startswith("#"): # skip comment lines.
continue
line=line.rstrip('\n').split('\t')
# This is the info column
info = line[7].split(";")
# Using slicing I extracted the DP4 part and removed the str DP4
DP4 = info[-2].replace("DP4=","")
# Then I took all the int outs and put them under the categories
ref_plus, ref_minus, alt_plus, alt_minus = map(int, DP4.split(','))
# calculated the mean for each one
formula = ((alt_minus+alt_plus)/(alt_minus+alt_plus+ref_minus+ref_plus))
# Get chromosome number from first field
chr_num = int(line[0].replace('chr', ''))
chromosomes[chr_num] = {f'{line[3]}->{line[4]}': formula}
return chromosomes

Related

How to get the count of a word in a database using pandas

For the same data base, the following 2 codes are showing different answers when executed. According to the answer given the 2nd one is correct but what is the mistake in the 1st code.
code 1
df=pd.read_csv("amazon_baby.csv", index_col ="name")
sw = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']
for i in sw:
df[i]=df["review"].str.count(i)
y=df[i].sum(axis=0)
print(i,y)
code 2
df=pd.read_csv("amazon_baby.csv", index_col ="name")
sw = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']
df['word_count']=df['review'].apply(lambda x:Counter(str(x).split()))
def great_count(x):
if 'great' in x:
return x.get('great')
else:
return 0
df['great3'] = df['word_count'].apply(great_count)
print (sum(df['great3']))
These pieces of code are quite different.
The first one takes each of the words in sw list and counts the number of occurences in string. This means that in string "this is great, this is the greatest" for word "great" it will show 2. This is the error, I suppose.
Second code splits the text in separate words: ['this', 'is', 'great,', 'this', 'is', 'the', 'greatest'], then calculates counts: Counter({'this': 2, 'is': 2, 'great,': 1, 'the': 1, 'greatest': 1}) and shows the sum of the column.
But!! There is no word "great" in the Counter - this is because of the comma. So this is also wrong.
A better way would be to get rid of punctuation at first. For example like this:
sum(1 for i in ''.join(ch for ch in t if ch not in string.punctuation).split() if i == 'great')

Question on calculating incoming data from file

If I am reading a data file with some variable, I need to calculate the total numbers of different items by adding them from different lines. For example:
Fruit,Number
banana,25
apple,12
kiwi,29
apple,44
apple,81
kiwi,3
banana,109
kiwi,113
kiwi,68
we would need to add a third variable which is a total of the fruit, and fouth total of all the fruits.
So the output should be like following:
Fruit,Number,TotalFruit,TotalAllFruits
banana,25,25,25
apple,12,12,37
kiwi,29,29,66
apple,44,56,110
apple,81,137,191
kiwi,3,32,194
banana,109,134,303
kiwi,113,145,416
kiwi,68,213,484
I was able to get the first 2 columns printed, but having problem with the last 2 columns
import sys
import re
f1 = open("SampleInput.csv", "r")
f2 = open('SampleOutput.csv', 'a')
sys.stdout = f2
print("Fruit,Number,TotalFruit,TotalAllFruits")
for line1 in f1:
fruit_list = line1.split(',')
exec("%s = %d" % (fruit_list[1], 0))
print(fruit_list[0] + ',' + fruit_list[1])
I am just learning python, so I want to apologize in advance if I am missing something very simple.
You need to declare a 2d-array to keep the values read from the input file.
And during the loop, you need to read the value from previous lines, and then calculate the value of the current line.
And print the 2d-array after all input lines read.
I would recommend you to use pandas library as it makes your process easier
import pandas as pd
df1 = pd.read_csv("SampleInput.csv",sep=",")
df2 = pd.DataFrame()
for index, row in df1.iterrows():
# change the above to what ever you need
df2['Totalsum'] = df1['TotalFruit'] + df1['TotalAllFruits']
df2['Fruit'] = df1['Fruit']
df2.to_csv('SampleOutput.csv',sep=",")
df2 format :
Fruit | Totalsum |
---------------------
Name | Sum |
---------------------
Feel free to change the number of columns to your needs and add your custom logic.

Arranging in ascending order in text file

So i have a text file which looks like this:
07,12,9201
07,12,9201
06,18,9209
06,18,9209
06,19,9209
06,19,9209
07,11,9201
I first want to remove all the duplicate lines, then sort column 1 in ascending order and then sort column 2 in ascending order given column 1 is still in ascending order.
output:
06,18,9209
06,19,9209
07,11,9201
07,12,9201
I have tried this so far:
with open('abc.txt') as f:
lines = [line.split(' ') for line in f]
Consider another example:
00,0,6098
00,1,6098
00,3,6098
00,4,6094
00,5,6094
00,6,6094
00,7,6094
00,8,6094
00,9,6498
00,2,6098
00,20,6102
00,21,6087
00,22,6087
00,23,6087
00,3,6098
00,4,6094
00,5,6094
00,6,6094
00,7,6094
00,8,6094
00,9,6498
The output for this file should be:
00,0,6098
00,1,6098
00,2,6098
00,3,6098
00,4,6094
00,5,6094
00,6,6094
00,7,6094
00,8,6094
00,9,6498
00,20,6102
00,21,6087
00,22,6087
00,23,6087
You can do something like below.
from itertools import groupby, chain
from collections import OrderedDict
input_file = 'input_file.txt'
# Collecting lines
lines = [tuple(line.strip().split(',')) for line in open(input_file)]
# Removing dups and Sorting by first column
sorted_lines = sorted(set(lines), key=lambda x: int(x[0]))
# Grouping and ordering by second column
result = OrderedDict()
for k, g in groupby(sorted_lines, key=lambda x: x[0]):
result[k] = sorted(g, key = lambda x : int(x[1]))
print(result)
for v in chain(*result.values()):
print(','.join(v))
Output 1:
06,18,9209
06,19,9209
07,11,9201
07,12,9201
Output 2:
00,0,6098
00,1,6098
00,2,6098
00,3,6098
00,4,6094
00,5,6094
00,6,6094
00,7,6094
00,8,6094
00,9,6498
00,20,6102
00,21,6087
00,22,6087
00,23,6087

Why does this iteration over a list of lists not work?

I am trying to look for keywords in sentences which is stored as a list of lists. The outer list contains sentences and the inner list contains words in sentences. I want to iterate over each word in each sentence to look for keywords defined and return me the values where found.
This is how my token_sentences looks like.
I took help from this post. How to iterate through a list of lists in python? However, I am getting an empty list in return.
This is the code I have written.
import nltk
from nltk.tokenize import TweetTokenizer, sent_tokenize, word_tokenize
text = "MDCT SCAN OF THE CHEST: HISTORY: Follow-up LUL nodule. TECHNIQUES: Non-enhanced and contrast-enhanced MDCT scans were performed with a slice thickness of 2 mm. COMPARISON: Chest CT dated on 01/05/2018, 05/02/207, 28/09/2016, 25/02/2016, and 21/11/2015. FINDINGS: Lung parenchyma: There is further increased size and solid component of part-solid nodule associated with internal bubbly lucency and pleural tagging at apicoposterior segment of the LUL (SE 3; IM 38-50), now measuring about 2.9x1.7 cm in greatest transaxial dimension (previously size 2.5x1.3 cm in 2015). Also further increased size of two ground-glass nodules at apicoposterior segment of the LUL (SE 3; IM 37), and superior segment of the LLL (SE 3; IM 58), now measuring about 1 cm (previously size 0.4 cm in 2015), and 1.1 cm (previously size 0.7 cm in 2015) in greatest transaxial dimension, respectively."
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in
nltk.sent_tokenize(text)]
nodule_keywords = ["nodules","nodule"]
count_nodule =[]
def GetNodule(sentence, keyword_list):
s1 = sentence.split(' ')
return [i for i in s1 if i in keyword_list]
for sub_list in tokens_sentences:
result_calcified_nod = GetNodule(sub_list[0], nodule_keywords)
count_nodule.append(result_calcified_nod)
However, I am getting the empty list as a result for the variable in count_nodule.
This is the value of first two rows of "token_sentences".
token_sentences = [['MDCT', 'SCAN', 'OF', 'THE', 'CHEST', ':', 'HISTORY', ':', 'Follow-up', 'LUL', 'nodule', '.'],['TECHNIQUES', ':', 'Non-enhanced', 'and', 'contrast-enhanced', 'MDCT', 'scans', 'were', 'performed', 'with', 'a', 'slice', 'thickness', 'of', '2', 'mm', '.']]
Please help me to figure out where I am doing wrong!
You need to remove s1 = sentence.split(' ') from GetNodule because sentence has already been tokenized (it is already a List).
Remove the [0] from GetNodule(sub_list[0], nodule_keywords). Not sure why you would want to pass the first word of each sentence into GetNodule!
The error is here:
for sub_list in tokens_sentences:
result_calcified_nod = GetNodule(sub_list[0], nodule_keywords)
You are looping over each sub_list in tokens_sentences, but only passing the first word sub_list[0] to GetNodule.
This type of error is fairly common, and somewhat hard to catch, because Python code which expects a list of strings will happily accept and iterate over the individual characters in a single string instead if you call it incorrectly. If you want to be defensive, maybe it would be a good idea to add something like
assert not all(len(x)==1 for x in sentence)
And of course, as #dyz notes in their answer, if you expect sentence to already be a list of words, there is no need to split anything inside the function. Just loop over the sentence.
return [w for w in sentence if w in keyword_list]
As an aside, you probably want to extend the final result with the list result_calcified_nod rather than append it.

iterating through values in a dictionary to invert key and values

I am trying to invert an italian-english dictionary using the code that follows.
Some terms have one translation, while others have multiple possibilities. If an entry has multiple translations I iterate through each word, adding it to english-italian dict (if not already present).
If there is a single translation it should not iterate, but as I have written the code, it does. Also only the last translation in the term with multiple translations is added to the dictionary. I cannot figure out how to rewrite the code to resolve what should be a really simple task
from collections import defaultdict
def invertdict():
source_dict ={'paramezzale (s.m.)': ['hog', 'keelson', 'inner keel'], 'vento (s.m.)': 'wind'}
english_dict = defaultdict(list)
for parola, words in source_dict.items():
if len(words) > 1: # more than one translation ?
for word in words: # if true, iterate through each word
word = str(word).strip(' ')
print(word)
else: # only one translation, don't iterate!!
word = str(words).strip(' ')
print(word)
if word in english_dict.keys(): # check to see if the term already exists
if english_dict[word] != parola: # check that the italian is not present
#english_dict[word] = [english_dict[word], parola]
english_dict[word].append(parola).strip('')
else:
english_dict[word] = parola.strip(' ')
print(len(english_dict))
for key,value in english_dict.items():
print(key, value)
When this code is run, I get :
hog
keelson
inner keel
w
i
n
d
2
inner keel paramezzale (s.m.)
d vento (s.m.)
instead of
hog: paramezzale, keelson: paramezzale, inner keel: paramezzale, wind: vento
It would be easier to use lists everywhere in the dictionary, like:
source_dict = {'many translations': ['a', 'b'], 'one translation': ['c']}
Then you need 2 nested loops. Right now you're not always running the inner loop.
for italian_word, english_words in source_dict.items():
for english_word in english_words:
# print, add to english dict, etc.
If you can't change the source_dict format, you need to check the type explicitly. I would transform the single item in a list.
for italian_word, item in source_dict.items():
if not isinstance(item, list):
item = [item]
Full code:
source_dict ={'paramezzale (s.m.)': ['hog', 'keelson', 'inner keel'], 'vento (s.m.)': ['wind']}
english_dict = defaultdict(list)
for parola, words in source_dict.items():
for word in words:
word = str(word).strip(' ')
# add to the list if not already present
# english_dict is a defaultdict(list) so we can use .append directly
if parola not in english_dict[word]:
english_dict[word].append(parola)

Resources