Getting a little stuck with NaN data. This program trawls through a folder in an external hard drive loads in a txt file as a dataframe, and should reads the very last value of the last column. As some of the last rows do not complete for what ever reason, i have chosen to take the row before (or that's what i hope to have done. Here is the code and I have commented the lines that I think are giving the trouble:
#!/usr/bin/env python3
import glob
import math
import pandas as pd
import numpy as np
def get_avitime(vbo):
try:
df = pd.read_csv(vbo,
delim_whitespace=True,
header=90)
row = next(df.iterrows())
t = df.tail(2).avitime.values[0]
return t
except:
pass
def human_time(seconds):
secs = seconds/1000
mins, secs = divmod(secs, 60)
hours, mins = divmod(mins, 60)
return '%02d:%02d:%02d' % (hours, mins, secs)
def main():
path = 'Z:\\VBox_Backup\\**\\*.vbo'
events = {}
customers = {}
for vbo_path in glob.glob(path, recursive=True):
path_list = vbo_path.split('\\')
event = path_list[2].upper()
customer = path_list[3].title()
avitime = get_avitime(vbo_path)
if not avitime: # this is to check there is a number
continue
else:
if event not in events:
events[event] = {customer:avitime}
print(event)
elif customer not in events[event]:
events[event][last_customer] = human_time(events[event][last_customer])
print(events[event][last_customer])
events[event][customer] = avitime
else:
total_time = events[event][customer]
total_time += avitime
events[event][customer] = total_time
last_customer = customer
events[event][customer] = human_time(events[event][customer])
df_events = pd.DataFrame(events)
df.to_csv('event_track_times.csv')
main()
I put in a line to check for a value, but I am guessing that NaN is not a null value, hence it hasn't quite worked.
C:\Users\rob.kinsey\AppData\Local\Continuum\Anaconda3) c:\Users\rob.kinsey\Pro
ramming>python test_single.py
BARCELONA
03:52:42
02:38:31
03:21:02
00:16:35
00:59:00
00:17:45
01:31:42
03:03:03
03:16:43
01:08:03
01:59:54
00:09:03
COTA
04:38:42
02:42:34
sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
04:01:13
01:19:47
03:09:31
02:37:32
03:37:34
02:14:42
04:53:01
LAGUNA_SECA
01:09:10
01:34:31
01:49:27
03:05:34
02:39:03
01:48:14
SILVERSTONE
04:39:31
01:52:21
02:53:42
02:10:44
02:11:17
02:37:11
01:19:12
04:32:21
05:06:43
SPA
Traceback (most recent call last):
File "test_single.py", line 56, in <module>
main()
File "test_single.py", line 41, in main
events[event][last_customer] = human_time(events[event][last_customer])
File "test_single.py", line 23, in human_time
The output is starting out correctly, except for the sys:1 error, but at least it carries on, and the final error that stalls the program completely. How can I get past this NaN issue, all variables I am working with should be of float data type or should have been ignored. All data types should only be strings or floats until the time conversion which are integers.
Ok, even though no one answered, I am compelled to answer my own question as I am not convinced I am the only person that has had this problem.
There are 3 main reasons for receiving NaN in a data frame, most of these revolve around infinity, such as using 'inf' as a value or dividing by zero, which will also provide NaN as a result, the wiki page was the most helpful for me in solving this issue:
https://en.wikipedia.org/wiki/NaN
One other important point about NaN it that is works a little like a virus, in that anything that touches it in any calculation will result in NaN, so the problem can get exponentially worse. Actually what you are dealing with is missing data, and until you realize that's what it is, NaN is the least useful and frustrating thing as it comes under a datatype not an error yet any mathematical operations will end in NaN. BEWARE!!
The reason on this occasion is because a specific line was used to get the headers when reading in the csv file and although that worked for the majority of these files, some of them had the headers I was after on a different line, as a result, the headers being imported into the data frame either were part of the data itself or a null value. As a result, trying to access a column in the data frame by header name resulted in NaN, and as discussed earlier, this proliferated though the program causing a few problems which had used workarounds to combat, one of which was actually acceptable which is to add this line:
df = df.fillna(0)
after the first definition of the df variable, in this case:
df= pd.read_csv(vbo,
delim_whitespace=True,
header=90)
The bottom line is that if you are receiving this value, the best thing really is to work out why you are getting NaN in the first place, then it is easier to make an informed decision as to whether or not replacing NaN with '0' is a viable choice.
I sincerely hope this helps anyone who finds it.
Regards
iFunction
Related
Summary
I am trying to loop through a pandas dataframe, and to run a secondary loop at each iteration. The secondary loop calculates something that I want to append into the original dataframe, so that when the primary loop advances, some of the rows are recalculated based on the changed values. (For those interested, this is a simple advective model of carbon accumulation in soils. When a new layer of soil is deposited, mixing processes penetrate into older layers and transform their properties to a set depth. Thus, each layer deposited changes those below it incrementally, until a former layer lies below the mixing depth.)
I have produced an example of how I want this to work, however it is generating the common error message:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)
I have looked into the linked information in the error message as well as myriad posts on this forum, but none get into the continual looping through a changed dataframe.
What I've tried, and some possible solutions
Below is some example code. This code works more or less as well as I want it to. But it produces the warning. Should I:
Suppress the warning and continue working with this architecture? In this case, am I asking for trouble with un-reproducible results?
Try a different architecture altogether, like a numpy array from the original dataframe?
Try df.append() or df.copy() to avoid the warning?
I have tried `df.copy()' to no avail - the warning was still thrown.
Example code:
import pandas as pd
a = pd.DataFrame(
{
'a':[x/2 for x in range(1,11)],
'b':['hot dog', 'slider', 'watermelon', 'funnel cake', 'cotton candy', 'lemonade', 'fried oreo', 'ice cream', 'corn', 'sausage'],
'c':['meat', 'meat', 'vegan', 'vegan', 'vegan', 'vegan', 'dairy','dairy', 'vegan', 'meat']
}
)
print(a)
z = [x/(x+2) for x in range(1,5)]
print(z)
#Primary loop through rows of the main dataframe
for ind, row in a.iterrows():
#Pull out a chunk of the dataframe. This is the portion of the dataframe that will be modified. What is below
#this is already modified and locked into the geological record. What is above has not yet been deposited.
b = a.iloc[ind:(ind+len(z)), :]
#Define the size of the secondary loop. Taking the minimum avoids the model mixing below the boundary layer (key error)
loop = min([len(z), len(b)])
#Now loop through the sub-dataframe and change accordingly.
for fraction in range(loop):
b['a'].iloc[fraction] = b['a'].iloc[fraction]*z[fraction]
#Append the original dataframe with new data:
a.iloc[ind:(ind+loop), :] = b
#Try df.copy(), but still throws warning!
#a.iloc[ind:(ind+loop), :] = b.copy()
print(a)
This one has been bothering me for awhile. I have all the pieces (I think) that work individually to create the output I'm looking for (calculate a profit and loss for a stock), but when put together they return nothing.
The dataframe itself is pretty self-explanatory so I haven't included an example. Basically the series includes Stock Symbol, Opening Time, Opening Price, Closing Time, Closing Price, and whether or not it was a long or short position.
Here's my code to calculate the P-L for a long position:
import pandas as pd
from yahoo_fin import stock_info as si
from datetime import datetime, timedelta, date
import time
def create_df3():
return pd.read_excel('Base_Sheet.xlsx', sheet_name="Closed_Pos", header=0)
def update_price(sym):
return si.get_live_price(sym)
long_pl_calc = ((df3['Close_Price']) / (df3['Entry_Price'])) - 1
close_long_pl = df3['P-L'].isnull and (df3['Long_Short'] == 'Long')
for row in df3.iterrows():
if close_long_pl is True:
return df3['P-L'].apply(long_pl_calc)
If I print long_pl_calc or close_long_pl, I get exactly what I expect. However, when I iterate through the series to return the calculation, I still end up with a 'NaN' value (but not an error).
Any help would be appreciated! I already know the solution I came to is terrible, but I've also tried at least a dozen other iterations with no success either.
Create a column df3['Long'] with 1 for the date you are long and 0 for the rest, then to have your long P&L (you could do the same for the short but don't forget to take the opposite sign of the daily return) you can do :
df['P&L Long'] = ((df3['Close_Price'] / df3['Entry_Price']) - 1) * df['Long']
Then for your df3['P-L'] it will be:
df['P-L'] = df['P&L Long'] + df['P&L Short']
I'm attempting to merge two large dataframes (one 50k+ values, and another with 650k+ values - pared down from 7M+). Merging/matching is being done via fuzzywuzzy, to find which string in the first dataframe matches which string in the other most closely.
At the moment, it takes about 3 minutes to test 100 rows for variables. Consequently, I'm attempting to institute Dask to help with the processing speed. In doing so, Dask returns the following error: 'NotImplementedError: Series getitem in only supported for other series objects with matching partition structure'
Presumably, the error is due to my dataframes not being of equal size. In trying to set a chunksize when converting the my pandas dataframes to dask dataframes, I receive an error (TypeError: 'float' object cannot be interpreted as an integer) even though I previous forced all my datatypes in each dataframe to 'objects'. Consequently, I was forced to use the npartitions parameter in the dataframe conversion, which then leads to the 'NotImplementedError' above.
I've tried to standardize the chunksize with partitions with a mathematical index, and also tried using the npartitions parameter to no effect, and resulting in the same NotImplementedError.
As mentioned my efforts to utilize this without Dask have been successful, but far too slow to be useful.
I've also taken a look at these questions/responses:
- Different error
- No solution presented
- Seems promising, but results are still slow
''''
aprices_filtered_ddf = dd.from_pandas(prices_filtered, chunksize = 25e6) #prices_filtered: 404.2MB
all_data_ddf = dd.from_pandas(all_data, chunksize = 25e6) #all_data: 88.7MB
# import dask
client = Client()
dask.config.set(scheduler='processes')
# Define matching function
def match_name(name, list_names, min_score=0):
# -1 score incase we don't get any matches max_score = -1
# Returning empty name for no match as well
max_name = ""
# Iterating over all names in the other
for name2 in list_names:
#Finding fuzzy match score
score = fuzz.token_set_ratio(name, name2)
# Checking if we are above our threshold and have a better score
if (score > min_score) & (score > max_score):
max_name = name2
max_score = score
return (max_name, max_score)
# List for dicts for easy dataframe creation
dict_list = []
# iterating over our players without salaries found above for name in prices_filtered_ddf['ndc_description'][:100]:
# Use our method to find best match, we can set a threshold here
match = client(match_name(name, all_data_ddf['ndc_description_agg'], 80))
# New dict for storing data
dict_ = {}
dict_.update({'ndc_description_agg' : name})
dict_.update({'ndc_description' : match[0]})
dict_.update({'score' : match[1]})
dict_list.append(dict_)
merge_table = pd.DataFrame(dict_list)
# Display results
merge_table
Here's the full error:
NotImplementedError Traceback (most recent call last)
<ipython-input-219-e8d4dcb63d89> in <module>
3 dict_list = []
4 # iterating over our players without salaries found above
----> 5 for name in prices_filtered_ddf['ndc_description'][:100]:
6 # Use our method to find best match, we can set a threshold here
7 match = client(match_name(name, all_data_ddf['ndc_description_agg'], 80))
C:\Anaconda\lib\site-packages\dask\dataframe\core.py in __getitem__(self, key)
2671 return Series(graph, name, self._meta, self.divisions)
2672 raise NotImplementedError(
-> 2673 "Series getitem in only supported for other series objects "
2674 "with matching partition structure"
2675 )
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure
''''
I expect that the merge_table will return, in a relatively short time, a dataframe with data for each of the update columns. At the moment, it's extremely slow.
I'm afraid there are a number of problems with this question, so after pointing these out, I can only provide some general guidance.
The traceback shown is clearly not produced by the code above
The indentation and syntax are broken
A distributed client is made, then config set not to use it ("processes" is not the distributed scheduler)
The client object is called, client(...), but it is not callable, this shouldn't work at all
The main processing function, match_name is called directly; how do you expect Dask to intervene?
You don't ever call compute(), so in the code given, I'm not sure Dask is invoked at all.
What you actually want to do:
Load your smaller, reference dataframe using pandas, and call client.scatter to make sure all the workers have it
Load your main data with dd.read_csv
Call df.map_partitions(..) to process your data, where the function you pass should take two pandas dataframes, and work row-by-row.
Encountered an error today involving importing a CSV file with dates. The file has known quality issues and in this case one entry was "3/30/3013" due to a data entry error.
Reading other entries about the OutOfBoundsDatetime error, datetime's upper limit maxes out at 4/11/2262. The suggested solution was to fix the formatting of the dates. In my case the date format is correct but the data is wrong.
Applying numpy logic:
df['Contract_Signed_Date'] = np.where(df['Contract_Signed_Date']>'12/16/2017',
df['Alt_Date'],df['Contract_Signed_Date'])
Essentially if the file's 'Contract Signed Date' is greater than today (being 12/16/2017), I want to use the Alt_Date column instead. It seems to work except when it hits the year 3013 entry it errors out. Whats a good pythonic way around the out of bounds error?
Perhaps hideously unpythonic but it appears to do what you want.
Input, file arthur.csv:
input_date,var1,var2
3/30/3013,2,34
02/2/2017,17,35
Code:
import pandas as pd
from io import StringIO
target_date='2017-12-17'
for_pandas = StringIO()
print ('input_date,var1,var2,alt_date', file=for_pandas) #new header
with open('arthur.csv') as arthur:
next(arthur) #skip header in csv
for line in arthur:
line_items = line.rstrip().split(',')
date = '{:4s}-{:0>2s}-{:0>2s}'.format(*list(reversed(line_items[0].split('/'))))
if date>target_date:
output = '{},{},{},{}'.format(*['NaT',line_items[1],line_items[2],date])
else:
output = '{},{},{},{}'.format(*[date,line_items[1],line_items[2],'NaT'])
print(output, file=for_pandas)
for_pandas.seek(0)
df = pd.read_csv(for_pandas, parse_dates=['input_date', 'alt_date'])
print (df)
Output:
0 NaT 2 34 3013-30-03
1 2017-02-02 17 35 NaT
I am (extremely) new to coding and trying to automate some processes for manipulating data as part of my PhD.
I have a CSV file from a heart rate monitor with time stored as MM:SS.s and heart rate at that time. e.g.
Time, Heart_rate
00:00.6, 100
00:01.0, 102
00:01.5, 102
I've used the CSV package to import and DictReader to get the data into an array.
import csv
with open('hr_data.csv', 'rU') as infile:
reader = csv.DictReader(infile, delimiter=',')
The data comes in as string so I have used the following code to try to first replace heart rate a float, and then convert time (e.g. 00:05.5 - for 5.5 seconds) to a float of seconds.
sec = 0
for row in reader:
row['Heart_rate'] = int(row['Heart_rate'])
temp = row.get('Time')
sec = (float(temp[3:7]) + (float(temp[0:2]) * 60))
row['Time'] = sec
This seems to work if I print(row) afterward (everything is a float and time is in units of seconds). However when I then move forward to try to bin the data into 10sec bins, everything has reverted back to the original string and I can't seem to:
for row in reader:
print(row)
as this just prints nothing...
Thanks in advance.
do
with open('hr_data.csv', 'rU') as infile:
reader = list(csv.DictReader(infile, delimiter=',’))
and it will work like you want.
csv.DictReader is a generator. It goes through each line one by one, and when it’s done it is done. Because you went through all the lines in the first for loop, it had read all the lines from the file, and was empty.
To save a generator’s results (store all the lines) cast the generator to a list.
Yes, your calculations seem correct. However, once you leave the environs of the with construct many items are lost. For instance, reader is build to give you the header here; it goes away. Since it departs so do the values for row that come from it. You need to arrange to save what you want as you go through the loop.
>>> import csv
>>> times = []
>>> heart_rates = []
>>> with open('heart.csv') as heart:
... reader = csv.DictReader(heart, skipinitialspace=True)
... for row in reader:
... temp = row['Time']
... times.append(float(temp[3:7]) + (float(temp[0:2]) * 60))
... heart_rates.append(int(row['Heart_rate']))
...
>>> times
[0.6, 1.0, 1.5]
>>> heart_rates
[100, 102, 102]
Correction: Upon discussing this with ddg I've learned that reader persists outside the with. Unfortunately though I haven't been able to re-read the rows in reader using for row in reader;row because, outside the with the file heart has been closed.