Difficulty permanently replacing values in nested dictionary imported from csv - python-3.x

I am (extremely) new to coding and trying to automate some processes for manipulating data as part of my PhD.
I have a CSV file from a heart rate monitor with time stored as MM:SS.s and heart rate at that time. e.g.
Time, Heart_rate
00:00.6, 100
00:01.0, 102
00:01.5, 102
I've used the CSV package to import and DictReader to get the data into an array.
import csv
with open('hr_data.csv', 'rU') as infile:
reader = csv.DictReader(infile, delimiter=',')
The data comes in as string so I have used the following code to try to first replace heart rate a float, and then convert time (e.g. 00:05.5 - for 5.5 seconds) to a float of seconds.
sec = 0
for row in reader:
row['Heart_rate'] = int(row['Heart_rate'])
temp = row.get('Time')
sec = (float(temp[3:7]) + (float(temp[0:2]) * 60))
row['Time'] = sec
This seems to work if I print(row) afterward (everything is a float and time is in units of seconds). However when I then move forward to try to bin the data into 10sec bins, everything has reverted back to the original string and I can't seem to:
for row in reader:
print(row)
as this just prints nothing...
Thanks in advance.

do
with open('hr_data.csv', 'rU') as infile:
reader = list(csv.DictReader(infile, delimiter=',’))
and it will work like you want.
csv.DictReader is a generator. It goes through each line one by one, and when it’s done it is done. Because you went through all the lines in the first for loop, it had read all the lines from the file, and was empty.
To save a generator’s results (store all the lines) cast the generator to a list.

Yes, your calculations seem correct. However, once you leave the environs of the with construct many items are lost. For instance, reader is build to give you the header here; it goes away. Since it departs so do the values for row that come from it. You need to arrange to save what you want as you go through the loop.
>>> import csv
>>> times = []
>>> heart_rates = []
>>> with open('heart.csv') as heart:
... reader = csv.DictReader(heart, skipinitialspace=True)
... for row in reader:
... temp = row['Time']
... times.append(float(temp[3:7]) + (float(temp[0:2]) * 60))
... heart_rates.append(int(row['Heart_rate']))
...
>>> times
[0.6, 1.0, 1.5]
>>> heart_rates
[100, 102, 102]
Correction: Upon discussing this with ddg I've learned that reader persists outside the with. Unfortunately though I haven't been able to re-read the rows in reader using for row in reader;row because, outside the with the file heart has been closed.

Related

What are faster ways of reading big data set and apply row wise operations other than pandas and dask?

I am working on a code where I need to populate a set of data structure based on each row of a big table. Right now, I am using pandas to read the data and do some elementary data validation preprocess. However, when I get to the rest of process and putting data in the corresponding data structure, it takes considerably long time for the loop to be completed and my data structures gets populated. For example, in the following code I have a table with 15 M records. Table has three columns and I create a foo() object base on each row and add it to a list.
# Profile.csv
# Index | Name | Family| DB
# ---------|------|-------|----------
# 0. | Jane | Doe | 08/23/1977
# ...
# 15000000 | Jhon | Doe | 01/01/2000
class foo():
def __init__(self, name, last, bd):
self.name = name
self.last = last
self.bd = bd
def populate(row, my_list):
my_list.append(foo(*row))
# reading the csv file and formatting the date column
df = pd.read_csv('Profile.csv')
df['DB'] = pd.to_datetime(df['DB'],'%Y-%m-%d')
# using apply to create an foo() object and add it to the list
my_list = []
gf.apply(populate, axis=1, args=(my_list,))
So the after using pandas to convert the string date to the date object, I just need to iterate over the DataFrame to creat my object and add them to the list. This process is very time taking (in my real example it is even taking more time since my data structure is more complex and I have more columns). So, I am wondering what is the best practice in this case to enhance my run time. Should I even use pandas to read my big tables and process through them row by row?
it would be simply faster using a file handle:
input_file = "profile.csv"
sep=";"
my_list = []
with open(input_file) as fh:
cols = {}
for i, col in enumerate(fh.readline().strip().split(sep)):
cols[col] = i
for line in fh:
line = line.strip().split(sep)
date = line[cols["DB"]].split("/")
date = [date[2], date[0], date[1]]
line[cols["DB"]] = "-".join(date)
populate(line, my_list)
There are multiple approaches for this kind of situation, however, the fastest and most effective method is using vectorization if possible. The solution for the example I demonstrated in this post using vectorization could be as follows:
my_list = [foo(*args) for args in zip(df["Name"],df["Family"],df["BD"])]
If the vectorization is not possible, converting the data framce to a dictionary could significantly improve the performance. For the current example if would be something like:
my_list = []
dc = df.to_dict()
for i, j in dc.items():
my_list.append(foo(dc["Name"][i], dc["Family"][i], dc["BD"][i]))
The last solution is particularly very effective if the type of structures and processes are more complex.

Is there any ways to make this more efficient?

I have 24 more attempts to submit this task. I spent hours and my brain does not work anymore. I am a beginner with Python can you please help to figure out what is wrong? I would love to see the correct code if possible.
Here is the task itself and the code I wrote below.
Note that you can have access to all standard modules/packages/libraries of your language. But there is no access to additional libraries (numpy in python, boost in c++, etc).
You are given a content of CSV-file with information about set of trades. It contains the following columns:
TIME - Timestamp of a trade in format Hour:Minute:Second.Millisecond
PRICE - Price of one share
SIZE - Count of shares executed in this trade
EXCHANGE - The exchange that executed this trade
For each exchange find the one minute-window during which the largest number of trades took place on this exchange.
Note that:
You need to send source code of your program.
You have only 25 attempts to submit a solutions for this task.
You have access to all standart modules/packages/libraries of your language. But there is no access to additional libraries (numpy in python, boost in c++, etc).
Input format
Input contains several lines. You can read it from standart input or file “trades.csv”
Each line contains information about one trade: TIME, PRICE, SIZE and EXCHANGE. Numbers are separated by comma.
Lines are listed in ascending order of timestamps. Several lines can contain the same timestamp.
Size of input file does not exceed 5 MB.
See the example below to understand the exact input format.
Output format
If input contains information about k exchanges, print k lines to standart output.
Each line should contain the only number — maximum number of trades during one minute-window.
You should print answers for exchanges in lexicographical order of their names.
Sample
Input Output
09:30:01.034,36.99,100,V
09:30:55.000,37.08,205,V
09:30:55.554,36.90,54,V
09:30:55.556,36.91,99,D
09:31:01.033,36.94,100,D
09:31:01.034,36.95,900,V
2
3
Notes
In the example four trades were executed on exchange “V” and two trades were executed on exchange “D”. Not all of the “V”-trades fit in one minute-window, so the answer for “V” is three.
X = []
with open('trades.csv', 'r') as tr:
for line in tr:
line = line.strip('\xef\xbb\xbf\r\n ')
X.append(line.split(','))
dex = {}
for item in X:
dex[item[3]] = []
for item in X:
dex[item[3]].append(float(item[0][:2])*60.+float(item[0][3:5])+float(item[0][6:8])/60.+float(item[0][9:])/60000.)
for item in dex:
count = 1
ccount = 1
if dex[item][len(dex[item])-1]-dex[item][0] <1:
count = len(dex[item])
else:
for t in range(len(dex[item])-1):
for tt in range(len(dex[item])-t-1):
if dex[item][tt+t+1]-dex[item][t] <1:
ccount += 1
else: break
if ccount>count:
count=ccount
ccount=1
print(count)
First of all it is not necessary to use datetime and csv modules for such a simple case (like in Ed-Ward's example).
If we remove colon and dot signs from the time strings it could be converted to int() directly - easier way than you tried in your example.
CSV features like dialect and special formatting not used so i suggest to use simple split(",")
Now about efficiency. Efficiency means time complexity.
The more times you go through your array with dates from the beginning to the end, the more complicated the algorithm becomes.
So our goal is to minimize cycles count, best to make only one pass by all rows and especially avoid nested loops and passing through collections from beginning to the end.
For such a task it is better to use deque, instead of tuple or list, because you can pop() first element and append last element with complexity of O(1).
Just append every time for needed exchange to the end of the exchange's queue until difference between current and first elements becomes more than 1 minute. Then just remove first element with popleft() and continue comparison. After whole file done - length of each queue will be the max 1min window.
Example with linear time complexity O(n):
from collections import deque
ex_list = {}
s = open("trades.csv").read().replace(":", "").replace(".", "")
for line in s.splitlines():
s = line.split(",")
curr_tm = int(s[0])
curr_ex = s[3]
if curr_ex not in ex_list:
ex_list[curr_ex] = deque()
ex_list[curr_ex].append(curr_tm)
if curr_tm >= ex_list[curr_ex][0] + 100000:
ex_list[curr_ex].popleft()
print("\n".join([str(len(ex_list[k])) for k in sorted(ex_list.keys())]))
This code should work:
import csv
import datetime
diff = datetime.timedelta(minutes=1)
def date_calc(start, dates):
for i, date in enumerate(dates):
if date >= start + diff:
return i
return i + 1
exchanges = {}
with open("trades.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
this_exchange = row[3]
if this_exchange not in exchanges:
exchanges[this_exchange] = []
time = datetime.datetime.strptime(row[0], "%H:%M:%S.%f")
exchanges[this_exchange].append(time)
ex_max = {}
for name, dates in exchanges.items():
ex_max[name] = 0
for i, d in enumerate(dates):
x = date_calc(d, dates[i:])
if x > ex_max[name]:
ex_max[name] = x
print('\n'.join([str(ex_max[k]) for k in sorted(ex_max.keys())]))
Output:
2
3
( obviously please check it for yourself before uploading it :) )
I think the issue with your current code is that you don't put the output in lexicographical order of their names...
If you want to use your current code, then here is a (hopefully) fixed version:
X = []
with open('trades.csv', 'r') as tr:
for line in tr:
line = line.strip('\xef\xbb\xbf\r\n ')
X.append(line.split(','))
dex = {}
counts = []
for item in X:
dex[item[3]] = []
for item in X:
dex[item[3]].append(float(item[0][:2])*60.+float(item[0][3:5])+float(item[0][6:8])/60.+float(item[0][9:])/60000.)
for item in dex:
count = 1
ccount = 1
if dex[item][len(dex[item])-1]-dex[item][0] <1:
count = len(dex[item])
else:
for t in range(len(dex[item])-1):
for tt in range(len(dex[item])-t-1):
if dex[item][tt+t+1]-dex[item][t] <1:
ccount += 1
else: break
if ccount>count:
count=ccount
ccount=1
counts.append((item, count))
counts.sort(key=lambda x: x[0])
print('\n'.join([str(x[1]) for x in counts]))
Output:
2
3
I do think you can make your life easier in the future by using Python's standard library, though :)

Separating data in pandas that has a variable format

I have a txt file that is an output from another modelling program where it is looking at parameters of a modeled node at a time. The data output is similar to the following below. My problem is the data is coming as a single column and is broken occasionally by a new header and then the first section of the column repeats (time), but the second portion is new. There are two things I would like to be able to do:
1) Break the data into the two columns time and data for the node. Then add the node label as the first column.
2) Later there is another parameter for the node, not immediately under where the information would be in the form Data 2 Node (XX,XX) that is the same as one previous.
This would give me 4 columns in the end with the first being the node id repeated, the second being the time, third being data parameter 1, and fourth being the matched data parameter 2.
I've included a small sample of the data below, but the output is nearly over 1,000,000 lines so it would be nice to use pandas or another python functionality to manipulate the data.
Thanks for the help!
Name 20 vs 2
----------------------------------
time Data 1 Node( 72, 23)
--------------------- ---------------------
4.1203924E-003 -3.6406431E-005
1.4085015E-002 -5.8257871E-004
2.4049638E-002 6.8743013E-004
3.4014260E-002 8.2296302E-005
4.3978883E-002 -1.2276627E-004
5.3943505E-002 1.9813024E-004
....
Name 20 vs 2
----------------------------------
time Data 1 Node( 72, 24)
--------------------- ---------------------
4.1203924E-003 -3.6406431E-005
1.4085015E-002 -5.8257871E-004
2.4049638E-002 6.8743013E-004
3.4014260E-002 8.2296302E-005
4.3978883E-002 -1.2276627E-004
5.3943505E-002 1.9813024E-004
So after a fair amount of googling I managed to piece this one together. The data I was looking at was space separated so I used the fixed file width read method in pandas, following that I looked at the indices of a few known elements in the list and used them to break up the data into two dataframes that I could merge and process after. I removed the lines and NaN values as they were not of interest to me. Following that the fun began on actually using the data.
import pandas
widths = [28, 27 ]
df = pd.read_fwf(filename, widths=widths, names=["Times", "Items"])
data = df["Items"].astype(str)
index1 = data.str.contains('Data 1').idxmax()
index2 = data.str.contains('Data 2').idxmax()
df2 = pd.read_fwf(filename, widths=widths, skiprows=index1, skipfooter = (len(df)-index2), header = None, names=["Times", "Prin. Stress 1"])
df2 = pd.read_fwf(filename, widths=widths, skiprows=index2, header = None, names=["Times", "Prin. Stress 2"])
df2["Prin. Stress 2"] = df3["Prin. Stress 2"]
df2 = df2[~df2["Times"].str.contains("---")] # removes lines ----
df2.dropna(inplace = True)

Data Processing in Python

I'm dealing with a text data file that has 8 columns, each listing temperature, time, damping coefficients, etc. I need to take lines of data only in the temperature range of 0.320 to 0.322.
Here is a sample line of my data (there are thousands of lines):
time temp acq. freq. amplitude damping etc....
6.28444 0.32060 413.00000 117.39371 48.65073 286.00159
The only columns I care about are time, temp, and damping. I need those three values to append to my lists, but only when the temperature is in the specified range (there are some lines of my data where the temperature is all the way up at 4 kelvins, and this data is garbage).
I am using Python 3. Here are the things I have tried thus far
f = open('alldata','r')
c = f.readlines()
temperature = []
newtemp = []
damping = []
time = []
for line in c [0:]:
line = line.split()
temperature.append(line[1])
damping.append(line[4])
time.append(line[0])
for i in temperature:
if float(i)>0.320 and float(i)<0.325:
newtemp.append(float(i))
when I printed the list newtemp, I could see that this code did correctly fill the list with temperature values only in that range, however I also need my damping list and time list to now only be filled with values that correspond to that small temperature range. I'm not sure how to achieve that with this code.
I have also tried this, recommended by someone here:
output = []
lines = open('alldata', 'r')
for line in lines:
temp = line.split()
if float(temp[1]) > 0.320 and float(temp[1]) < 0.322:
output.append(line)
print(output)
And I get an error that says:
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_data_rate_limit.
I will note that I am very new to coding, so I apologize if this turns out to be a silly question.
Data:
temperature, time, coeff...
0.32, 12:00:23, 2,..
0.43, 11:22:23, 3,..
Here, temperature is in the first column.
output = []
lines = open('data.file', 'r')
for line in lines:
temp = line.split(',')
if float(temp[0]) > 0.320 and float(temp[0]) < 0.322:
output.append(line)
print output
You can use pandas module:
import pandas as pd
# if the file with the data is an excel file use:
df = pd.read_excel('data.xlsx')
# if the file is csv
df = pd.read_csv('data.csv')
# if the column name of interest is named 'temperature'
selected = df['temperature'][(df['temperature'] > 0.320) & (df['temperature'] < 0.322)]
If you do not have pandas installed see here

ValueError, though check has already be performed for this

Getting a little stuck with NaN data. This program trawls through a folder in an external hard drive loads in a txt file as a dataframe, and should reads the very last value of the last column. As some of the last rows do not complete for what ever reason, i have chosen to take the row before (or that's what i hope to have done. Here is the code and I have commented the lines that I think are giving the trouble:
#!/usr/bin/env python3
import glob
import math
import pandas as pd
import numpy as np
def get_avitime(vbo):
try:
df = pd.read_csv(vbo,
delim_whitespace=True,
header=90)
row = next(df.iterrows())
t = df.tail(2).avitime.values[0]
return t
except:
pass
def human_time(seconds):
secs = seconds/1000
mins, secs = divmod(secs, 60)
hours, mins = divmod(mins, 60)
return '%02d:%02d:%02d' % (hours, mins, secs)
def main():
path = 'Z:\\VBox_Backup\\**\\*.vbo'
events = {}
customers = {}
for vbo_path in glob.glob(path, recursive=True):
path_list = vbo_path.split('\\')
event = path_list[2].upper()
customer = path_list[3].title()
avitime = get_avitime(vbo_path)
if not avitime: # this is to check there is a number
continue
else:
if event not in events:
events[event] = {customer:avitime}
print(event)
elif customer not in events[event]:
events[event][last_customer] = human_time(events[event][last_customer])
print(events[event][last_customer])
events[event][customer] = avitime
else:
total_time = events[event][customer]
total_time += avitime
events[event][customer] = total_time
last_customer = customer
events[event][customer] = human_time(events[event][customer])
df_events = pd.DataFrame(events)
df.to_csv('event_track_times.csv')
main()
I put in a line to check for a value, but I am guessing that NaN is not a null value, hence it hasn't quite worked.
C:\Users\rob.kinsey\AppData\Local\Continuum\Anaconda3) c:\Users\rob.kinsey\Pro
ramming>python test_single.py
BARCELONA
03:52:42
02:38:31
03:21:02
00:16:35
00:59:00
00:17:45
01:31:42
03:03:03
03:16:43
01:08:03
01:59:54
00:09:03
COTA
04:38:42
02:42:34
sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
04:01:13
01:19:47
03:09:31
02:37:32
03:37:34
02:14:42
04:53:01
LAGUNA_SECA
01:09:10
01:34:31
01:49:27
03:05:34
02:39:03
01:48:14
SILVERSTONE
04:39:31
01:52:21
02:53:42
02:10:44
02:11:17
02:37:11
01:19:12
04:32:21
05:06:43
SPA
Traceback (most recent call last):
File "test_single.py", line 56, in <module>
main()
File "test_single.py", line 41, in main
events[event][last_customer] = human_time(events[event][last_customer])
File "test_single.py", line 23, in human_time
The output is starting out correctly, except for the sys:1 error, but at least it carries on, and the final error that stalls the program completely. How can I get past this NaN issue, all variables I am working with should be of float data type or should have been ignored. All data types should only be strings or floats until the time conversion which are integers.
Ok, even though no one answered, I am compelled to answer my own question as I am not convinced I am the only person that has had this problem.
There are 3 main reasons for receiving NaN in a data frame, most of these revolve around infinity, such as using 'inf' as a value or dividing by zero, which will also provide NaN as a result, the wiki page was the most helpful for me in solving this issue:
https://en.wikipedia.org/wiki/NaN
One other important point about NaN it that is works a little like a virus, in that anything that touches it in any calculation will result in NaN, so the problem can get exponentially worse. Actually what you are dealing with is missing data, and until you realize that's what it is, NaN is the least useful and frustrating thing as it comes under a datatype not an error yet any mathematical operations will end in NaN. BEWARE!!
The reason on this occasion is because a specific line was used to get the headers when reading in the csv file and although that worked for the majority of these files, some of them had the headers I was after on a different line, as a result, the headers being imported into the data frame either were part of the data itself or a null value. As a result, trying to access a column in the data frame by header name resulted in NaN, and as discussed earlier, this proliferated though the program causing a few problems which had used workarounds to combat, one of which was actually acceptable which is to add this line:
df = df.fillna(0)
after the first definition of the df variable, in this case:
df= pd.read_csv(vbo,
delim_whitespace=True,
header=90)
The bottom line is that if you are receiving this value, the best thing really is to work out why you are getting NaN in the first place, then it is easier to make an informed decision as to whether or not replacing NaN with '0' is a viable choice.
I sincerely hope this helps anyone who finds it.
Regards
iFunction

Resources