Question on calculating incoming data from file - python-3.x

If I am reading a data file with some variable, I need to calculate the total numbers of different items by adding them from different lines. For example:
Fruit,Number
banana,25
apple,12
kiwi,29
apple,44
apple,81
kiwi,3
banana,109
kiwi,113
kiwi,68
we would need to add a third variable which is a total of the fruit, and fouth total of all the fruits.
So the output should be like following:
Fruit,Number,TotalFruit,TotalAllFruits
banana,25,25,25
apple,12,12,37
kiwi,29,29,66
apple,44,56,110
apple,81,137,191
kiwi,3,32,194
banana,109,134,303
kiwi,113,145,416
kiwi,68,213,484
I was able to get the first 2 columns printed, but having problem with the last 2 columns
import sys
import re
f1 = open("SampleInput.csv", "r")
f2 = open('SampleOutput.csv', 'a')
sys.stdout = f2
print("Fruit,Number,TotalFruit,TotalAllFruits")
for line1 in f1:
fruit_list = line1.split(',')
exec("%s = %d" % (fruit_list[1], 0))
print(fruit_list[0] + ',' + fruit_list[1])
I am just learning python, so I want to apologize in advance if I am missing something very simple.

You need to declare a 2d-array to keep the values read from the input file.
And during the loop, you need to read the value from previous lines, and then calculate the value of the current line.
And print the 2d-array after all input lines read.

I would recommend you to use pandas library as it makes your process easier
import pandas as pd
df1 = pd.read_csv("SampleInput.csv",sep=",")
df2 = pd.DataFrame()
for index, row in df1.iterrows():
# change the above to what ever you need
df2['Totalsum'] = df1['TotalFruit'] + df1['TotalAllFruits']
df2['Fruit'] = df1['Fruit']
df2.to_csv('SampleOutput.csv',sep=",")
df2 format :
Fruit | Totalsum |
---------------------
Name | Sum |
---------------------
Feel free to change the number of columns to your needs and add your custom logic.

Related

Fill csv data lists with for loop

I am manipulating .csv files. I have to loop through each column of numeric data in the file and enter them into different lists. The code I have is the following:
import csv
salto_linea = "\n"
csv_file = "02_CSV_data1.csv"
with open(csv_file, 'r') as csv_doc:
doc_reader = csv.reader(csv_doc, delimiter = ",")
mpg = []
cylinders = []
displacement = []
horsepower = []
weight = []
acceleration = []
year = []
origin = []
lt = [mpg, cylinders, displacement, horsepower,
weight, acceleration, year, origin]
for i,ln in zip(range (0,9),lt):
print(f"{i} -> {ln}")
for row in doc_reader:
y = row[i]
ln.append(y)
In the loop, try to have range() serve me as an index so that in the nested for loop, it loops through the first column (the first element of each row in the csv) and feeds it into the first list of 'lt'. The problem I have is that I go through the data column and enter it, but range() continues to advance in the first loop, ending the nesting, thinking that it would iterate i = 1, and that new value of 'i' would enter again. the nested loop traversing the next column and vice versa. I also tried it with some other while loop to iterate a counter that adds to each iteration and serves as an index but it didn't work either.
How I can fill the sublists in 'lt' with the data which is inside the csv file??
without seing the ontents of the CSV file itself, the best way of reading the data into a table is with the pandas module, which can be done in one line of code.
import pandas as pd
df = pd.read_csv('02_CSV_data1.csv')
this would have read all the data into a dataframe and you can work with this.
Alternatively, ammend the for loop like this:
for row in doc_reader:
for i, ln in enumerate(lt):
ln.append(row[i])
for bigger data, i would prefer pandas which has vectorised methods.

What are faster ways of reading big data set and apply row wise operations other than pandas and dask?

I am working on a code where I need to populate a set of data structure based on each row of a big table. Right now, I am using pandas to read the data and do some elementary data validation preprocess. However, when I get to the rest of process and putting data in the corresponding data structure, it takes considerably long time for the loop to be completed and my data structures gets populated. For example, in the following code I have a table with 15 M records. Table has three columns and I create a foo() object base on each row and add it to a list.
# Profile.csv
# Index | Name | Family| DB
# ---------|------|-------|----------
# 0. | Jane | Doe | 08/23/1977
# ...
# 15000000 | Jhon | Doe | 01/01/2000
class foo():
def __init__(self, name, last, bd):
self.name = name
self.last = last
self.bd = bd
def populate(row, my_list):
my_list.append(foo(*row))
# reading the csv file and formatting the date column
df = pd.read_csv('Profile.csv')
df['DB'] = pd.to_datetime(df['DB'],'%Y-%m-%d')
# using apply to create an foo() object and add it to the list
my_list = []
gf.apply(populate, axis=1, args=(my_list,))
So the after using pandas to convert the string date to the date object, I just need to iterate over the DataFrame to creat my object and add them to the list. This process is very time taking (in my real example it is even taking more time since my data structure is more complex and I have more columns). So, I am wondering what is the best practice in this case to enhance my run time. Should I even use pandas to read my big tables and process through them row by row?
it would be simply faster using a file handle:
input_file = "profile.csv"
sep=";"
my_list = []
with open(input_file) as fh:
cols = {}
for i, col in enumerate(fh.readline().strip().split(sep)):
cols[col] = i
for line in fh:
line = line.strip().split(sep)
date = line[cols["DB"]].split("/")
date = [date[2], date[0], date[1]]
line[cols["DB"]] = "-".join(date)
populate(line, my_list)
There are multiple approaches for this kind of situation, however, the fastest and most effective method is using vectorization if possible. The solution for the example I demonstrated in this post using vectorization could be as follows:
my_list = [foo(*args) for args in zip(df["Name"],df["Family"],df["BD"])]
If the vectorization is not possible, converting the data framce to a dictionary could significantly improve the performance. For the current example if would be something like:
my_list = []
dc = df.to_dict()
for i, j in dc.items():
my_list.append(foo(dc["Name"][i], dc["Family"][i], dc["BD"][i]))
The last solution is particularly very effective if the type of structures and processes are more complex.

Is there any ways to make this more efficient?

I have 24 more attempts to submit this task. I spent hours and my brain does not work anymore. I am a beginner with Python can you please help to figure out what is wrong? I would love to see the correct code if possible.
Here is the task itself and the code I wrote below.
Note that you can have access to all standard modules/packages/libraries of your language. But there is no access to additional libraries (numpy in python, boost in c++, etc).
You are given a content of CSV-file with information about set of trades. It contains the following columns:
TIME - Timestamp of a trade in format Hour:Minute:Second.Millisecond
PRICE - Price of one share
SIZE - Count of shares executed in this trade
EXCHANGE - The exchange that executed this trade
For each exchange find the one minute-window during which the largest number of trades took place on this exchange.
Note that:
You need to send source code of your program.
You have only 25 attempts to submit a solutions for this task.
You have access to all standart modules/packages/libraries of your language. But there is no access to additional libraries (numpy in python, boost in c++, etc).
Input format
Input contains several lines. You can read it from standart input or file “trades.csv”
Each line contains information about one trade: TIME, PRICE, SIZE and EXCHANGE. Numbers are separated by comma.
Lines are listed in ascending order of timestamps. Several lines can contain the same timestamp.
Size of input file does not exceed 5 MB.
See the example below to understand the exact input format.
Output format
If input contains information about k exchanges, print k lines to standart output.
Each line should contain the only number — maximum number of trades during one minute-window.
You should print answers for exchanges in lexicographical order of their names.
Sample
Input Output
09:30:01.034,36.99,100,V
09:30:55.000,37.08,205,V
09:30:55.554,36.90,54,V
09:30:55.556,36.91,99,D
09:31:01.033,36.94,100,D
09:31:01.034,36.95,900,V
2
3
Notes
In the example four trades were executed on exchange “V” and two trades were executed on exchange “D”. Not all of the “V”-trades fit in one minute-window, so the answer for “V” is three.
X = []
with open('trades.csv', 'r') as tr:
for line in tr:
line = line.strip('\xef\xbb\xbf\r\n ')
X.append(line.split(','))
dex = {}
for item in X:
dex[item[3]] = []
for item in X:
dex[item[3]].append(float(item[0][:2])*60.+float(item[0][3:5])+float(item[0][6:8])/60.+float(item[0][9:])/60000.)
for item in dex:
count = 1
ccount = 1
if dex[item][len(dex[item])-1]-dex[item][0] <1:
count = len(dex[item])
else:
for t in range(len(dex[item])-1):
for tt in range(len(dex[item])-t-1):
if dex[item][tt+t+1]-dex[item][t] <1:
ccount += 1
else: break
if ccount>count:
count=ccount
ccount=1
print(count)
First of all it is not necessary to use datetime and csv modules for such a simple case (like in Ed-Ward's example).
If we remove colon and dot signs from the time strings it could be converted to int() directly - easier way than you tried in your example.
CSV features like dialect and special formatting not used so i suggest to use simple split(",")
Now about efficiency. Efficiency means time complexity.
The more times you go through your array with dates from the beginning to the end, the more complicated the algorithm becomes.
So our goal is to minimize cycles count, best to make only one pass by all rows and especially avoid nested loops and passing through collections from beginning to the end.
For such a task it is better to use deque, instead of tuple or list, because you can pop() first element and append last element with complexity of O(1).
Just append every time for needed exchange to the end of the exchange's queue until difference between current and first elements becomes more than 1 minute. Then just remove first element with popleft() and continue comparison. After whole file done - length of each queue will be the max 1min window.
Example with linear time complexity O(n):
from collections import deque
ex_list = {}
s = open("trades.csv").read().replace(":", "").replace(".", "")
for line in s.splitlines():
s = line.split(",")
curr_tm = int(s[0])
curr_ex = s[3]
if curr_ex not in ex_list:
ex_list[curr_ex] = deque()
ex_list[curr_ex].append(curr_tm)
if curr_tm >= ex_list[curr_ex][0] + 100000:
ex_list[curr_ex].popleft()
print("\n".join([str(len(ex_list[k])) for k in sorted(ex_list.keys())]))
This code should work:
import csv
import datetime
diff = datetime.timedelta(minutes=1)
def date_calc(start, dates):
for i, date in enumerate(dates):
if date >= start + diff:
return i
return i + 1
exchanges = {}
with open("trades.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
this_exchange = row[3]
if this_exchange not in exchanges:
exchanges[this_exchange] = []
time = datetime.datetime.strptime(row[0], "%H:%M:%S.%f")
exchanges[this_exchange].append(time)
ex_max = {}
for name, dates in exchanges.items():
ex_max[name] = 0
for i, d in enumerate(dates):
x = date_calc(d, dates[i:])
if x > ex_max[name]:
ex_max[name] = x
print('\n'.join([str(ex_max[k]) for k in sorted(ex_max.keys())]))
Output:
2
3
( obviously please check it for yourself before uploading it :) )
I think the issue with your current code is that you don't put the output in lexicographical order of their names...
If you want to use your current code, then here is a (hopefully) fixed version:
X = []
with open('trades.csv', 'r') as tr:
for line in tr:
line = line.strip('\xef\xbb\xbf\r\n ')
X.append(line.split(','))
dex = {}
counts = []
for item in X:
dex[item[3]] = []
for item in X:
dex[item[3]].append(float(item[0][:2])*60.+float(item[0][3:5])+float(item[0][6:8])/60.+float(item[0][9:])/60000.)
for item in dex:
count = 1
ccount = 1
if dex[item][len(dex[item])-1]-dex[item][0] <1:
count = len(dex[item])
else:
for t in range(len(dex[item])-1):
for tt in range(len(dex[item])-t-1):
if dex[item][tt+t+1]-dex[item][t] <1:
ccount += 1
else: break
if ccount>count:
count=ccount
ccount=1
counts.append((item, count))
counts.sort(key=lambda x: x[0])
print('\n'.join([str(x[1]) for x in counts]))
Output:
2
3
I do think you can make your life easier in the future by using Python's standard library, though :)

How to split a row by specific string length in a dataframe in Python?

I have a file like this:
system
1000
1VEA C 1 9.294 11.244 11.083
1VEA C1 2 9.324 11.375 11.161
1VEA H 3 9.243 11.396 11.232
...
1203VEA H2092601 20.738 16.293 7.837
1203VEA H2192602 20.900 16.225 7.869
1203VEA H2292603 20.822 16.330 7.989
I want to generate a dataframe which include 6 columns. I used following command to
df = pd.read_csv('system.gro', skiprows=[0,1], delim_whitespace=True, header=None)
generate this dataframe. However, when it came to the row started with 1203, columns between H20 and 92601 has no white space and I cannot just use above command to split it. I used to split the line string by specific length like:
f1 = open(fileName, 'r')
for line in f1.readlines():
atomName = line[8:15].strip(' ')
globalIdx = int(line[15:20].strip(' '))
But it takes really long time to deal with the file. Does anyone has any idea about how to deal with this using dataframe?
As suggested by SRT HellKitty in the comments, use pd.read_fwf (see docs) like this:
import pandas as pd
data="""
1VEA C 1 9.294 11.244 11.083
1VEA C1 2 9.324 11.375 11.161
1VEA H 3 9.243 11.396 11.232
1203VEA H2092601 20.738 16.293 7.837
1203VEA H2192602 20.900 16.225 7.869
1203VEA H2292603 20.822 16.330 7.989
"""
### make sure that the widths are correct!
df=pd.read_fwf(pd.compat.StringIO(data),colspecs=[(0,8),(8,14),(14,20),(20,28),(28,36),(36,44)])
print(df)

Data Processing in Python

I'm dealing with a text data file that has 8 columns, each listing temperature, time, damping coefficients, etc. I need to take lines of data only in the temperature range of 0.320 to 0.322.
Here is a sample line of my data (there are thousands of lines):
time temp acq. freq. amplitude damping etc....
6.28444 0.32060 413.00000 117.39371 48.65073 286.00159
The only columns I care about are time, temp, and damping. I need those three values to append to my lists, but only when the temperature is in the specified range (there are some lines of my data where the temperature is all the way up at 4 kelvins, and this data is garbage).
I am using Python 3. Here are the things I have tried thus far
f = open('alldata','r')
c = f.readlines()
temperature = []
newtemp = []
damping = []
time = []
for line in c [0:]:
line = line.split()
temperature.append(line[1])
damping.append(line[4])
time.append(line[0])
for i in temperature:
if float(i)>0.320 and float(i)<0.325:
newtemp.append(float(i))
when I printed the list newtemp, I could see that this code did correctly fill the list with temperature values only in that range, however I also need my damping list and time list to now only be filled with values that correspond to that small temperature range. I'm not sure how to achieve that with this code.
I have also tried this, recommended by someone here:
output = []
lines = open('alldata', 'r')
for line in lines:
temp = line.split()
if float(temp[1]) > 0.320 and float(temp[1]) < 0.322:
output.append(line)
print(output)
And I get an error that says:
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_data_rate_limit.
I will note that I am very new to coding, so I apologize if this turns out to be a silly question.
Data:
temperature, time, coeff...
0.32, 12:00:23, 2,..
0.43, 11:22:23, 3,..
Here, temperature is in the first column.
output = []
lines = open('data.file', 'r')
for line in lines:
temp = line.split(',')
if float(temp[0]) > 0.320 and float(temp[0]) < 0.322:
output.append(line)
print output
You can use pandas module:
import pandas as pd
# if the file with the data is an excel file use:
df = pd.read_excel('data.xlsx')
# if the file is csv
df = pd.read_csv('data.csv')
# if the column name of interest is named 'temperature'
selected = df['temperature'][(df['temperature'] > 0.320) & (df['temperature'] < 0.322)]
If you do not have pandas installed see here

Resources