How to sum specific values from two different txt files in python - python-3.x

I have 2 txt files with names and scores. For example:
File 1 File 2 Desired Output
Name Score Name Score Name Score
Michael 20 Michael 30 Michael 50
Adrian 40 Adrian 50 Adrian 90
Jane 60 Jane 60
I want to sum scores with same names and print them. I tried to pair names and scores in two different dictionaries and after that merge the dictionaries. However, I can't keep same names with different scores. So, I'm stuck here. I've written something like following :
d1=dict()
d2=dict()
with open('data1.txt', "r") as f:
test = [i for line in f for i in line.split()]
i = 0
while i < len(test) - 1:
d1[test[i]] = test[i + 1]
i += 2
del d1['Name']
with open('data2.txt', "r") as f:
test = [i for line in f for i in line.split()]
i = 0
while i < len(test) - 1:
d2[test[i]] = test[i + 1]
i += 2
del d2['Name']
z = dict(d2.items() | d1.items())

Using a dictionary comprehension should get you what you are after. I have assumed the contents of the files are:
File1.txt:
Name Score
Michael 20
Adrian 40
Jane 60
File2.txt:
Name Score
Michael 30
Adrian 50
Then you can get a total as:
with open("file1.txt", "r") as file_in:
next(file_in) # skip header
file1_data = dict(row.split() for row in file_in if row)
with open("file2.txt", "r") as file_in:
next(file_in) # skip header
file2_data = dict(row.split() for row in file_in if row)
result = {
key: int(file1_data.get(key, 0)) + int(file2_data.get(key, 0))
for key
in set(file1_data).union(file2_data) # could also use file1_data.keys()
}
print(result)
This should give you a result like:
{'Michael': 50, 'Jane': 60, 'Adrian': 90}

Use defaultdict
from collections import defaultdict
name_scores = defaultdict(int)
files = ('data1.txt', 'data2.txt')
for file in files:
with open(file, 'r') as f:
for name, score in f.split():
name_scores[name] += int(score)
edit: You'll probably have to skip any header line and maybe clean up trailing white spaces, but the gist of it is above.

Related

How to extract many groups of cells separated by a specified number of rows in excel using python and write it to an other file?

I have a csv file which has around 58 million cells containing numerical data. I want to extract data from every 16 cells which are 49 rows apart.
Let me describe it clearly.
The data I need to extract
The above image shows the the first set of data that is to be extracted (rows 23 to 26, columns 92 to 95). This data has to be written in another file csv file (preferably in a row).
Then I will move down 49 rows (row 72), then extract 4rows x 4columns. Shown in image below.
Next set of data
Similarly, I need to keep going till I reach the end of the file.
Third set
The next set will be the image shown above.
I have to keep going till I reach the end of the file and extract thousands of such data.
I had written a code for this but its not working. I don't know where is the mistake. I will also attach it here.
import pandas as pd
import numpy
df = pd.read_csv('TS_trace31.csv')
# print(numpy.shape(df))
df = pd.read_csv('TS_trace31.csv')
# print(numpy.shape(df))
arrY = []
ex = 0
for i in range(len(df)):
if i == 0:
for j in range(4):
l = (df.iloc[j+21+i*(49), 91:95]).tolist()
arrY.append(l)
else:
for j in range(4):
if j+22+i*(49) >= len(df):
ex = 1
break
# print(j)
l = (df.iloc[j+21+i*(49), 91:95]).tolist()
arrY.append(l)
if ex == 1:
break
# print(arrY)
a = []
for i in range(len(arrY) - 3):
p = arrY[i]+arrY[i+1]+arrY[i+2]+arrY[i+3]
a.append(p)
print(numpy.shape(a))
numpy.savetxt('myfile.csv', a, delimiter=',')
Using the above code, I didn't get the result I wanted.
Please help with this and correct where I have gone wrong.
I couldn't attach my csv file here, Please try to use any sample sheet that you have or can create a simple one.
Thanks in advance! Have a great day.
i don't know what exactly you are doing in your code
but i wrote my own
import csv
from itertools import chain
CSV_PATH = 'TS_trace31.csv'
new_data = []
with open(CSV_PATH, 'r') as csvfile:
reader = csv.reader(csvfile)
# row_num for storing big jumps e.g. 23, 72, 121 ...
row_num = 23
# n for storing the group number 0 - 3
# with n we can find the 23, 24, 25, 26
n = 0
# row_group for storing every 4 group rows
row_group = []
# looping over every row in main file
for row in reader:
if reader.line_num == row_num + n:
# for the first time this is going to be 23 + 0
# then we add one number to the n
# so the next cycle will be 24 and so on
n += 1
print(reader.line_num)
# add each row to it group
row_group.append(row[91:95])
# check if we are at the end of the group e.g. 26
if n == 4:
# reset the group number
n = 0
# add the jump to main row number
row_num += 49
# combine all the row_group to a single row
new_data.append(list(chain(*row_group)))
# clear the row_group for next set of rows
row_group.clear()
print('='*50)
else:
continue
# and finally write all the rows in a new file
with open('myfile.csv', 'w') as new_csvfile:
writer = csv.writer(new_csvfile)
writer.writerows(new_data)

Creating two columns from an unstructured file of IDs and sequences

Problem: Working with python 3.x, I have a file called input.txt with content as below
2345673 # First ID
0100121102020211111002 # first sequence (seq) which is long and goes to several lines
0120102100211001101200
6758442 #Second ID
0202111100011111022222 #second sequence (seq) which is long and goes to several lines
0202111110001120211210
0102101011211001101200
What i want: To process input.txt and save the results in output.csv and when i read it in pandas the
result should be a data frame like below.
ID Seq
2345673 0 1 0 0 1 2 1 1 0 2 …
6758442 0 2 0 2 1 1 1 1 0 0 …
Below is my code
with open("input.txt") as f:
with open("out.csv", "w") as f1:
for i, line in enumerate(f): #read each line in file
if(len(line) < 15 ): #check if length line is say < 15
id = line # if yes, make line ID
else:
seq = line # if not make it a sequence
#print(id)
lines = []
lines.append(','.join([str(id),str(seq)]))
for l in lines:
f1.write('('+l+'),\n') #write to file f1
when i read out.csv in pandas the output is not what i want. see below. Please i will appreciate your help , i am really stocked.
(2345673
,0100121102020211111002
),
(2345673
,0120102100211001101200
),
(6758442
,0202111100011111022222
),
(6758442
,0202111110001120211210
),
(6758442
,0102101011211001101200),
import pandas as pd
### idea is to create two lists: one with ids and another with sequences
with open("input.txt") as f:
ids=[]
seqs=[]
seq=""
for i, line in enumerate(f):
if (len(line) < 15 ) :
seqs.append(seq)
id=line
id=id.rstrip('\n')
id=id.rstrip(' ')
ids.append(id)
seq=""
else:
#next three lines combine all sequences that correspond the same id into one
additional_seq = line.rstrip('\n')
additional_seq = additional_seq.rstrip(' ')
seq+=additional_seq
seqs.append(seq)
seqs=seqs[1:]
df = pd.DataFrame(list(zip(ids, seqs)), columns =['id', 'seq'])
df.to_scv("out.csv",index=False)

get multiple colums into text file

I have a CSV file and I want to convert it to a text file based on the first column which is the ids. and then each file contain multiple columns. for example
file.csv
id val1 val 2 val3
1 50 52 60
2 45 84 96
and etc.
here is my code:
dir_name = '/Users/user/My Documents/test/'
with io.open('file1.csv', 'rt',encoding='utf8') as f:
reader = csv.reader(f, delimiter=',')
next(reader)
xx = []
for row in reader:
with open(os.path.join(dir_name, row[0] + ".txt"),'a') as f2:
xx = row[1:2]
f2.write(xx +"\n")
so it should be:
1.text
50 52 60
2.text
45 84 96
but it only creates files without content.
can anyone help me?. Thanks in advance
There were a couple of issues:
It's actually a whitespace separated values file, not a comma-separated values file. So, you have to change the delimiter from ,. Also, the whitespace is repeated, so you can pass an additional flag to the csv module.
Some funkiness with the array indexing and conversion to string.
This program meets your requirements:
#!/usr/bin/python
import io
import csv
import os
dir_name = './'
with io.open('input.csv', 'rt',encoding='utf8') as f:
reader = csv.reader(f, skipinitialspace=True, delimiter=' ')
next(reader)
xx = []
for row in reader:
filename = os.path.join(dir_name, row[0])
with open(filename + ".txt", 'a') as f2:
xx = row[1:]
f2.write(" ".join(xx) +"\n")

How to calculate from a dictionary in python

import operator
with open("D://program.txt") as f:
Results = {}
for line in f:
part_one,part_two = line.split()
Results[part_one] = part_two
c=sum(int(Results[x]) for x in Results)
r=c/12
d=len(Results)
F=max(Results.items(), key=operator.itemgetter(1))[0]
u=min(Results.items(), key=operator.itemgetter(1))[0]
print ("Number of entries are",d)
print ("Student with HIGHEST mark is",F)
print ("Student with LOWEST mark is",u)
print ("Avarage mark is",r)
Results = [ (v,k) for k,v in Results.items() ]
Results.sort(reverse=True)
for v,k in Results:
print(k,v)
import sys
orig_stdout = sys.stdout
f = open('D://programssr.txt', 'w')
sys.stdout = f
print ('Number of entries are',d)
print ("Student with HIGHEST mark is",F)
print ("Student with LOWEST mark is",u)
print ("Avarage mark is",r)
for v,k in Results:
print(k,v)
sys.stdout = orig_stdout
f.close()
I want to read a txt file but problem is it cant compute the results i want to write in a new file because of the NAMES and MARKS in file.if you remove them it works fine.i want to make calculations without removing NAMES and MARKS in txt file..Help what i am i doing wrong
NAMES MARKS
Lux 95
Veron 70
Lesley 88
Sticks 80
Tipsey 40
Joe 62
Goms 18
Wesley 35
Villa 11
Dentist 72
Onty 50
Just consume the first line using next() function, before looping over it:
with open("D://program.txt") as f:
Results = {}
next(f)
for line in f:
part_one,part_two = line.split()
Results[part_one] = part_two
Note that file objects are iterator-like object (one shot iterable) and when you loop over them you consume the items and you have no access to them anymore.

Specific Fields Python3

I try to select specific fields from my Qdata.txt file and use field[2] to calculate average for every years separate. My code give only total average.
data file looks like: (1. day of year: 101 and last: 1231)
Date 3700300 6701500
20000101 21.00 223.00
20000102 20.00 218.00
. .
20001231 7.40 104.00
20010101 6.70 104.00
. .
20130101 8.37 111.63
. .
20131231 45.00 120.98
import sys
td=open("Qdata.txt","r") # open file Qdata
total=0
count=0
row1=True
for row in td :
if (row1) :
row1=False # row1 is for topic
else:
fields=row.split()
try:
total=total+float(fields[2])
count=count+1
# Errors.
except IndexError:
continue
except ValueError:
print("File is incorrect.")
sys.exit()
print("Average in 2000 was: ",total/count)
You could use itertools.groupby using the first four characters as the key for grouping.
with open("data.txt") as f:
next(f) # skip first line
groups = itertools.groupby(f, key=lambda s: s[:4])
for k, g in groups:
print(k, [s.split() for s in g])
This gives you the entries grouped by year, for further processing.
Output for your example data:
2000 [['20000101', '21.00', '223.00'], ['20000102', '20.00', '218.00'], ['20001231', '7.40', '104.00']]
2001 [['20010101', '6.70', '104.00']]
2013 [['20130101', '8.37', '111.63'], ['20131231', '45.00', '120.98']]
You could create a dict (or even a defaultdict) for total and count instead:
import sys
from collections import defaultdict
td=open("Qdata.txt","r") # open file Qdata
total=defaultdict(float)
count=defaultdict(int)
row1=True
for row in td :
if (row1) :
row1=False # row1 is for topic
else:
fields=row.split()
try:
year = int(fields[0][:4])
total[year] += float(fields[2])
count[year] += 1
# Errors.
except IndexError:
continue
except ValueError:
print("File is incorrect.")
sys.exit()
print("Average in 2000 was: ",total[2000]/count[2000])
Every year separate? You have to divide your input into groups, something like this might be what you want:
from collections import defaultdict
row1 = True
year_sums = defaultdict(list)
for row in td:
if row1:
row1 = False
continue
fields = row.split()
year = fields[0][:4]
year_sums[year].append(float(fields[2]))
for year in year_sums:
avarage = sum(year_sums[year])/count(year_sums[year])
print("Avarage in {} was: {}".format(year, avarage)
That is just some example code, I don't know if it works for sure, but should give you an idea what you can do. year_sums is a defaultdict containing lists of values grouped by years. You can then use it for other statistics if you want.

Resources