Creating two columns from an unstructured file of IDs and sequences - python-3.x

Problem: Working with python 3.x, I have a file called input.txt with content as below
2345673 # First ID
0100121102020211111002 # first sequence (seq) which is long and goes to several lines
0120102100211001101200
6758442 #Second ID
0202111100011111022222 #second sequence (seq) which is long and goes to several lines
0202111110001120211210
0102101011211001101200
What i want: To process input.txt and save the results in output.csv and when i read it in pandas the
result should be a data frame like below.
ID Seq
2345673 0 1 0 0 1 2 1 1 0 2 …
6758442 0 2 0 2 1 1 1 1 0 0 …
Below is my code
with open("input.txt") as f:
with open("out.csv", "w") as f1:
for i, line in enumerate(f): #read each line in file
if(len(line) < 15 ): #check if length line is say < 15
id = line # if yes, make line ID
else:
seq = line # if not make it a sequence
#print(id)
lines = []
lines.append(','.join([str(id),str(seq)]))
for l in lines:
f1.write('('+l+'),\n') #write to file f1
when i read out.csv in pandas the output is not what i want. see below. Please i will appreciate your help , i am really stocked.
(2345673
,0100121102020211111002
),
(2345673
,0120102100211001101200
),
(6758442
,0202111100011111022222
),
(6758442
,0202111110001120211210
),
(6758442
,0102101011211001101200),

import pandas as pd
### idea is to create two lists: one with ids and another with sequences
with open("input.txt") as f:
ids=[]
seqs=[]
seq=""
for i, line in enumerate(f):
if (len(line) < 15 ) :
seqs.append(seq)
id=line
id=id.rstrip('\n')
id=id.rstrip(' ')
ids.append(id)
seq=""
else:
#next three lines combine all sequences that correspond the same id into one
additional_seq = line.rstrip('\n')
additional_seq = additional_seq.rstrip(' ')
seq+=additional_seq
seqs.append(seq)
seqs=seqs[1:]
df = pd.DataFrame(list(zip(ids, seqs)), columns =['id', 'seq'])
df.to_scv("out.csv",index=False)

Related

How to extract many groups of cells separated by a specified number of rows in excel using python and write it to an other file?

I have a csv file which has around 58 million cells containing numerical data. I want to extract data from every 16 cells which are 49 rows apart.
Let me describe it clearly.
The data I need to extract
The above image shows the the first set of data that is to be extracted (rows 23 to 26, columns 92 to 95). This data has to be written in another file csv file (preferably in a row).
Then I will move down 49 rows (row 72), then extract 4rows x 4columns. Shown in image below.
Next set of data
Similarly, I need to keep going till I reach the end of the file.
Third set
The next set will be the image shown above.
I have to keep going till I reach the end of the file and extract thousands of such data.
I had written a code for this but its not working. I don't know where is the mistake. I will also attach it here.
import pandas as pd
import numpy
df = pd.read_csv('TS_trace31.csv')
# print(numpy.shape(df))
df = pd.read_csv('TS_trace31.csv')
# print(numpy.shape(df))
arrY = []
ex = 0
for i in range(len(df)):
if i == 0:
for j in range(4):
l = (df.iloc[j+21+i*(49), 91:95]).tolist()
arrY.append(l)
else:
for j in range(4):
if j+22+i*(49) >= len(df):
ex = 1
break
# print(j)
l = (df.iloc[j+21+i*(49), 91:95]).tolist()
arrY.append(l)
if ex == 1:
break
# print(arrY)
a = []
for i in range(len(arrY) - 3):
p = arrY[i]+arrY[i+1]+arrY[i+2]+arrY[i+3]
a.append(p)
print(numpy.shape(a))
numpy.savetxt('myfile.csv', a, delimiter=',')
Using the above code, I didn't get the result I wanted.
Please help with this and correct where I have gone wrong.
I couldn't attach my csv file here, Please try to use any sample sheet that you have or can create a simple one.
Thanks in advance! Have a great day.
i don't know what exactly you are doing in your code
but i wrote my own
import csv
from itertools import chain
CSV_PATH = 'TS_trace31.csv'
new_data = []
with open(CSV_PATH, 'r') as csvfile:
reader = csv.reader(csvfile)
# row_num for storing big jumps e.g. 23, 72, 121 ...
row_num = 23
# n for storing the group number 0 - 3
# with n we can find the 23, 24, 25, 26
n = 0
# row_group for storing every 4 group rows
row_group = []
# looping over every row in main file
for row in reader:
if reader.line_num == row_num + n:
# for the first time this is going to be 23 + 0
# then we add one number to the n
# so the next cycle will be 24 and so on
n += 1
print(reader.line_num)
# add each row to it group
row_group.append(row[91:95])
# check if we are at the end of the group e.g. 26
if n == 4:
# reset the group number
n = 0
# add the jump to main row number
row_num += 49
# combine all the row_group to a single row
new_data.append(list(chain(*row_group)))
# clear the row_group for next set of rows
row_group.clear()
print('='*50)
else:
continue
# and finally write all the rows in a new file
with open('myfile.csv', 'w') as new_csvfile:
writer = csv.writer(new_csvfile)
writer.writerows(new_data)

Using pd.read_table() multiple times on same open file

I have a data structure of the following form:
**********DATA:0************
name_A name_B
0.16561919 0.03640960
0.39564838 0.66708115
0.60828075 0.95785214
0.68716186 0.92803331
0.80615505 0.96219926
**********data:0************
**********DATA:1************
name_A name_B
0.32474381 0.82506909
0.30934914 0.60406956
0.99519513 0.23425607
0.72210821 0.61141751
0.47362605 0.09892009
**********data:1************
**********DATA:2************
name_A name_B
0.46561919 0.13640960
0.29564838 0.66708115
0.40828075 0.35785214
0.08716186 0.52803331
0.70615505 0.96219926
**********data:2************
I would like to read each block to a seperate pandas dataframe with appropriate header titles. When I use the simple function below, only a single data block is stored in the output list. However, when I comment out the data.append(pd.read_table(file, nrows=5)) line, the function prints all individual headers. The pandas read_table call seems to break out of the loop.
import pandas as pd
def read_data(filename):
data = []
with open(filename) as file:
for line in file:
if "**********DATA:" in line:
print(line)
data.append(pd.read_table(file, nrows=5))
return data
read_data("data_file.txt")
How should I change the function to read all blocks?
I suggest a slightly different approach, in which you avoid using read_table and put dataframes in a dict instead of a list, like this:
import pandas as pd
def read_data(filename):
data = {}
i = 0
with open(filename) as file:
for line in file:
if "**********DATA:" in line:
data[i] = []
continue
if "**********data:" in line:
i += 1
data[i] = []
continue
else:
data[i].append(line.strip("\n").split(" "))
return {
f"data_{k}": pd.DataFrame(data=v[1:], columns=v[0])
for k, v in data.items()
if v
}
And so, with the text file you gave as input:
dfs = read_data("data_file.txt")
print(dfs["data_0"])
# Output
name_A name_B
0 0.16561919 0.03640960
1 0.39564838 0.66708115
2 0.60828075 0.95785214
3 0.68716186 0.92803331
4 0.80615505 0.96219926
print(dfs["data_1"])
# Output
name_A name_B
0 0.32474381 0.82506909
1 0.30934914 0.60406956
2 0.99519513 0.23425607
3 0.72210821 0.61141751
4 0.47362605 0.09892009
print(dfs["data_2"])
# Output
name_A name_B
0 0.46561919 0.13640960
1 0.29564838 0.66708115
2 0.40828075 0.35785214
3 0.08716186 0.52803331
4 0.70615505 0.96219926

How to sum specific values from two different txt files in python

I have 2 txt files with names and scores. For example:
File 1 File 2 Desired Output
Name Score Name Score Name Score
Michael 20 Michael 30 Michael 50
Adrian 40 Adrian 50 Adrian 90
Jane 60 Jane 60
I want to sum scores with same names and print them. I tried to pair names and scores in two different dictionaries and after that merge the dictionaries. However, I can't keep same names with different scores. So, I'm stuck here. I've written something like following :
d1=dict()
d2=dict()
with open('data1.txt', "r") as f:
test = [i for line in f for i in line.split()]
i = 0
while i < len(test) - 1:
d1[test[i]] = test[i + 1]
i += 2
del d1['Name']
with open('data2.txt', "r") as f:
test = [i for line in f for i in line.split()]
i = 0
while i < len(test) - 1:
d2[test[i]] = test[i + 1]
i += 2
del d2['Name']
z = dict(d2.items() | d1.items())
Using a dictionary comprehension should get you what you are after. I have assumed the contents of the files are:
File1.txt:
Name Score
Michael 20
Adrian 40
Jane 60
File2.txt:
Name Score
Michael 30
Adrian 50
Then you can get a total as:
with open("file1.txt", "r") as file_in:
next(file_in) # skip header
file1_data = dict(row.split() for row in file_in if row)
with open("file2.txt", "r") as file_in:
next(file_in) # skip header
file2_data = dict(row.split() for row in file_in if row)
result = {
key: int(file1_data.get(key, 0)) + int(file2_data.get(key, 0))
for key
in set(file1_data).union(file2_data) # could also use file1_data.keys()
}
print(result)
This should give you a result like:
{'Michael': 50, 'Jane': 60, 'Adrian': 90}
Use defaultdict
from collections import defaultdict
name_scores = defaultdict(int)
files = ('data1.txt', 'data2.txt')
for file in files:
with open(file, 'r') as f:
for name, score in f.split():
name_scores[name] += int(score)
edit: You'll probably have to skip any header line and maybe clean up trailing white spaces, but the gist of it is above.

I want to read file and write another file. Basically, I want to do some arithmetic and write few other columns

I have a file like
2.0 4 3
0.5 5 4
-0.5 6 1
-2.0 7 7
.......
the actual file is pretty big
which I want to read and add couple of columns, first added column, column(4) = column(2) * column(3) and 2nd column added would be column 5 = column(2)/column(1) + column(4) so the result should be
2.0 4 3 12 14
0.5 5 4 20 30
-0.5 6 1 6 -6
-2.0 7 7 49 45.5
.....
which I want to write in a different file.
with open('test3.txt', encoding ='latin1') as rf:
with open('test4.txt', 'w') as wf:
for line in rf:
float_list= [float(i) for i in line.split()]
print(float_list)
But so far I just have this. I am just able create the list not sure how to perform arithmetic on the list and create new columns. I think I am completely off here. I am just a beginner in python. Any help will be greatly appreciated. Thanks!
I would reuse your formulae, but shifting indexes since they start at 0 in python.
I would extend the read column list of floats with the new computations, and write back the line, space separated (converting back to str in a list comprehension)
So, the inner part of the loop can be written as follows:
with open('test3.txt', encoding ='latin1') as rf:
with open('test4.txt', 'w') as wf:
for line in rf:
column= [float(i) for i in line.split()] # your code
column.append(column[1] * column[2]) # add column
column.append(column[1]/column[0] + column[3]) # add another column
wf.write(" ".join([str(x) for x in column])+"\n") # write joined strings, separated by spaces
Something like this - see comments in code
with open('test3.txt', encoding ='latin1') as rf:
with open('test4.txt', 'w') as wf:
for line in rf:
float_list = [float(i) for i in line.split()]
# calculate two new columns
float_list.append(float_list[1] * float_list[2])
float_list.append(float_list[1]/float_list[0] + float_list[3])
# convert all values to text
text_list = [str(i) for i in float_list]
# concatente all elements and write line
wf.write(' '.join(text_list) + '\n')
Try the following:
map() is used to convert each element of the list to float, by the end it is used again to convert each float to str so we can concatenate them.
with open('out.txt', 'w') as out:
with open('input.txt', 'r') as f:
for line in f:
my_list = map(float, line.split())
my_list.append(my_list[1]*my_list[2])
my_list.append(my_list[1] / my_list[0] + my_list[3])
my_list = map(str, my_list)
out.write(' '.join(my_list) + '\n')

Creating a dictionary to count the number of occurrences of Sequence IDs

I'm trying to write a function to count the number of each sequence ID that occurs in this file (it's a sample blast file)
The picture above is the input file I'm dealing with.
def count_seq(input):
dic1={}
count=0
for line in input:
if line.startswith('#'):
continue
if line.find('hits found'):
line=line.split('\t')
if line[1] in dic1:
dic1[line]+=1
else:
dic1[line]=1
return dic1
Above is my code which when called just returns empty brackets {}
So I'm trying to count how many times each of the sequence IDs (second element of last 13 lines) occur eg: FO203510.1 occurs 4 times.
Any help would be appreciated immensely, thanks!
Maybe this is what you're after:
def count_seq(input_file):
dic1={}
with open(input_file, "r") as f:
for line in f:
line = line.strip()
if not line.startswith('#'):
line = line.split()
seq_id = line[1]
if not seq_id in dic1:
dic1[seq_id] = 1
else:
dic1[seq_id] += 1
return dic1
print(count_seq("blast_file"))
This is a fitting case for collections.defaultdict. Let f be the file object. Assuming the sequences are in the second column, it's only a few lines of code as shown.
from collections import defaultdict
d = defaultdict(int)
seqs = (line.split()[1] for line in f if not line.strip().startswith("#"))
for seq in seqs:
d[seq] += 1
See if it works!

Resources