Encountered an error today involving importing a CSV file with dates. The file has known quality issues and in this case one entry was "3/30/3013" due to a data entry error.
Reading other entries about the OutOfBoundsDatetime error, datetime's upper limit maxes out at 4/11/2262. The suggested solution was to fix the formatting of the dates. In my case the date format is correct but the data is wrong.
Applying numpy logic:
df['Contract_Signed_Date'] = np.where(df['Contract_Signed_Date']>'12/16/2017',
df['Alt_Date'],df['Contract_Signed_Date'])
Essentially if the file's 'Contract Signed Date' is greater than today (being 12/16/2017), I want to use the Alt_Date column instead. It seems to work except when it hits the year 3013 entry it errors out. Whats a good pythonic way around the out of bounds error?
Perhaps hideously unpythonic but it appears to do what you want.
Input, file arthur.csv:
input_date,var1,var2
3/30/3013,2,34
02/2/2017,17,35
Code:
import pandas as pd
from io import StringIO
target_date='2017-12-17'
for_pandas = StringIO()
print ('input_date,var1,var2,alt_date', file=for_pandas) #new header
with open('arthur.csv') as arthur:
next(arthur) #skip header in csv
for line in arthur:
line_items = line.rstrip().split(',')
date = '{:4s}-{:0>2s}-{:0>2s}'.format(*list(reversed(line_items[0].split('/'))))
if date>target_date:
output = '{},{},{},{}'.format(*['NaT',line_items[1],line_items[2],date])
else:
output = '{},{},{},{}'.format(*[date,line_items[1],line_items[2],'NaT'])
print(output, file=for_pandas)
for_pandas.seek(0)
df = pd.read_csv(for_pandas, parse_dates=['input_date', 'alt_date'])
print (df)
Output:
0 NaT 2 34 3013-30-03
1 2017-02-02 17 35 NaT
Related
I have a text file (Player_hits.text) that I am trying to pull player batting averages from. Similar to lines 179-189 I want to find an average. However, I do not want to find the average for the entire team. Instead, I want to find the average of every individual player on the team.
For instance, the text file is set up as such:
Player_hits.txt
In this file a 1 defines a hit and a 0 means the player did not get a hit. I am trying to pull an individual average for both players. (Alex = 0.500, Riley = 0.666)
If someone could help, that would be greatly appreciated!
Thanks!
Link to original code on repl.it: Baseball Stat-Tracking
JSONDecodeError Image
The json.decoder.JSONDecodeError: is coming because the json.loads() doesn't interpret that (each line, '[[1, 'Riley']\n'as valid json format. You can use ast to read in that list as a literal evaluation, thus storing that as a list element [', 'Riley'] in your list of p_hits.
Then the second part is you can convert to the dataframe and groupby the 'name' column. So jim has the right idea, but there's errors in that too (Ie. colmuns should be columns, and the items in the list need to be strings ['hit','name'], not undeclared variables.
import pandas as pd
import ast
p_hits = []
with open('Player_hits.txt') as hits:
for line in hits:
l = ast.literal_eval(line)
p_hits.append(l)
df = pd.DataFrame(p_hits, columns=['hit', 'name'])
Output: with an example dataset I made
print(df.groupby(['name']).mean())
hit
name
Matt 0.714286
Riley 0.285714
Todd 0.500000
import pandas as pd
import json
p_hits = []
with open('Player_hits.txt') as hits:
for line in hits:
l = json.loads(line)
p_hits.append(l)
df = pd.DataFrame.from_records(p_hits, colmuns=[hit, name])
df.groupby(['name']).mean()
I'm having problems reading in the following three seasons of data (all the seasons after these ones load without problem).
import pandas as pd
import itertools
alphabets = ['a','b', 'c', 'd']
keywords = [''.join(i) for i in itertools.product(alphabets, repeat = 3)]
col_names = keywords[:57]
seasons = [2002, 2003, 2004]
for season in seasons):
df = pd.read_csv("https://www.football-data.co.uk/mmz4281/{}{}/E0.csv".format(str(season)[-2:], str(season+1)[-2:]), names=col_names).dropna(how='all')
This gives the following error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 57 fields in line 337, saw 62
I have looked on stack overflow for problems that have a similar error code (see below)but none seem to offer a solution that fits my problem.
Python Pandas Error tokenizing data
I'm pretty sure the error is caused when there is missing data in the last column, however I don't know how to fix it, can someone please explain how to do this?
Thanks
Baz
UPDATE:
The amended code now works for seasons 2002 and 2003. However 2004 is now producing a new error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte
Following the answer below from Serge Ballesta option 2:
UnicodeDecodeError when reading CSV file in Pandas with Python
df = pd.read_csv("https://www.football-data.co.uk/mmz4281/{}{}/E0.csv".format(str(season)[-2:], str(season+1)[-2:]), names=col_names, encoding = "latin1").dropna(how='all')
With the above amendment the code also works for season=2004.
I still have two questions though:
Q1.) How can I find which character/s were causing the problem is season 2004?
Q2.) Is it safe to use the 'latin1' encoding for every season even though there wre originally encoded at 'utf-8>
this is my first post on stackoveflow and I'm pretty new to programming especially python. I'm in engineering and am learning python to compliment that going forward, mostly at math and graphing applications.
Basically my question is how do I download csv excel data off a source (in my case stock data from google), and plot only certain rows against the date. For myself I want the date against the close value.
Right now the error message I'm getting is timedata '5-Jul-17' does not match '%d-%m-%Y'
previously I was also getting tuple data does not match
The description of the opened csv data in excel is
[7 columns (Date,Open,High,Low,Close,AdjClose,Volume, and the date is organized as 2017-05-30][1]
I'm sure there are other errors as well unfortunately
I would really be grateful for any help on this,
thank you in advance!
--edit--
Upon fiddling some more I don't think names and dtypes are necessary, when I check the matrix dimensions without those identifiers I get (250L, 6L) which seems right. Now my main problem is coverting the dates to something usable, My error now is strptime only accepts strings, so I'm not sure what to use. (see updated code below)
import matplotlib.pyplot as plt
importnumpy as np
from datetime import datetime
def graph_data(stock):
%getting the data off google finance
data = np.genfromtxt('urlgoeshere'+stock+'forthecsvdata', delimiter=',',
skip_header=1)
# checking format of matrix
print data.shape (returns 250L,6L)
time_format = '%d-%m-%Y'
# I only want the 1st column (dates) and 5 column (close), all rows
date = data[:,0][:,]
close = data[:,4][:,]
dates = [datetime.strptime(date, time_format)]
%plotting section
plt.plot_date(dates,close, '-')
plt.legend()
plt.show()
graph_data('stockhere')
Assuming the dates in the csv file are in the format '5-Jul-17', the proper format string to use is %d-%b-%y.
In [6]: datetime.strptime('5-Jul-17','%d-%m-%Y')
ValueError: time data '5-Jul-17' does not match format '%d-%m-%Y'
In [7]: datetime.strptime('5-Jul-17','%d-%b-%y')
Out[7]: datetime.datetime(2017, 7, 5, 0, 0)
See the Python documentation on strptime() behavior.
Getting a little stuck with NaN data. This program trawls through a folder in an external hard drive loads in a txt file as a dataframe, and should reads the very last value of the last column. As some of the last rows do not complete for what ever reason, i have chosen to take the row before (or that's what i hope to have done. Here is the code and I have commented the lines that I think are giving the trouble:
#!/usr/bin/env python3
import glob
import math
import pandas as pd
import numpy as np
def get_avitime(vbo):
try:
df = pd.read_csv(vbo,
delim_whitespace=True,
header=90)
row = next(df.iterrows())
t = df.tail(2).avitime.values[0]
return t
except:
pass
def human_time(seconds):
secs = seconds/1000
mins, secs = divmod(secs, 60)
hours, mins = divmod(mins, 60)
return '%02d:%02d:%02d' % (hours, mins, secs)
def main():
path = 'Z:\\VBox_Backup\\**\\*.vbo'
events = {}
customers = {}
for vbo_path in glob.glob(path, recursive=True):
path_list = vbo_path.split('\\')
event = path_list[2].upper()
customer = path_list[3].title()
avitime = get_avitime(vbo_path)
if not avitime: # this is to check there is a number
continue
else:
if event not in events:
events[event] = {customer:avitime}
print(event)
elif customer not in events[event]:
events[event][last_customer] = human_time(events[event][last_customer])
print(events[event][last_customer])
events[event][customer] = avitime
else:
total_time = events[event][customer]
total_time += avitime
events[event][customer] = total_time
last_customer = customer
events[event][customer] = human_time(events[event][customer])
df_events = pd.DataFrame(events)
df.to_csv('event_track_times.csv')
main()
I put in a line to check for a value, but I am guessing that NaN is not a null value, hence it hasn't quite worked.
C:\Users\rob.kinsey\AppData\Local\Continuum\Anaconda3) c:\Users\rob.kinsey\Pro
ramming>python test_single.py
BARCELONA
03:52:42
02:38:31
03:21:02
00:16:35
00:59:00
00:17:45
01:31:42
03:03:03
03:16:43
01:08:03
01:59:54
00:09:03
COTA
04:38:42
02:42:34
sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
04:01:13
01:19:47
03:09:31
02:37:32
03:37:34
02:14:42
04:53:01
LAGUNA_SECA
01:09:10
01:34:31
01:49:27
03:05:34
02:39:03
01:48:14
SILVERSTONE
04:39:31
01:52:21
02:53:42
02:10:44
02:11:17
02:37:11
01:19:12
04:32:21
05:06:43
SPA
Traceback (most recent call last):
File "test_single.py", line 56, in <module>
main()
File "test_single.py", line 41, in main
events[event][last_customer] = human_time(events[event][last_customer])
File "test_single.py", line 23, in human_time
The output is starting out correctly, except for the sys:1 error, but at least it carries on, and the final error that stalls the program completely. How can I get past this NaN issue, all variables I am working with should be of float data type or should have been ignored. All data types should only be strings or floats until the time conversion which are integers.
Ok, even though no one answered, I am compelled to answer my own question as I am not convinced I am the only person that has had this problem.
There are 3 main reasons for receiving NaN in a data frame, most of these revolve around infinity, such as using 'inf' as a value or dividing by zero, which will also provide NaN as a result, the wiki page was the most helpful for me in solving this issue:
https://en.wikipedia.org/wiki/NaN
One other important point about NaN it that is works a little like a virus, in that anything that touches it in any calculation will result in NaN, so the problem can get exponentially worse. Actually what you are dealing with is missing data, and until you realize that's what it is, NaN is the least useful and frustrating thing as it comes under a datatype not an error yet any mathematical operations will end in NaN. BEWARE!!
The reason on this occasion is because a specific line was used to get the headers when reading in the csv file and although that worked for the majority of these files, some of them had the headers I was after on a different line, as a result, the headers being imported into the data frame either were part of the data itself or a null value. As a result, trying to access a column in the data frame by header name resulted in NaN, and as discussed earlier, this proliferated though the program causing a few problems which had used workarounds to combat, one of which was actually acceptable which is to add this line:
df = df.fillna(0)
after the first definition of the df variable, in this case:
df= pd.read_csv(vbo,
delim_whitespace=True,
header=90)
The bottom line is that if you are receiving this value, the best thing really is to work out why you are getting NaN in the first place, then it is easier to make an informed decision as to whether or not replacing NaN with '0' is a viable choice.
I sincerely hope this helps anyone who finds it.
Regards
iFunction
How i find out which person stayed maximum nights? Name and total how many days? (date format MM/DD)
for example
text file contain's
Robin 01/11 01/15
Mike 02/10 02/12
John 01/15 02/15
output expected
('john', 30 )
my code
def longest_stay(fpath):
with open(fpath,'r')as f_handle:
stay=[]
for line in f_handle:
name, a_date, d_date = line.strip().split()
diff = datetime.strptime(d_date, "%m/%d") -datetime.strptime(a_date, "%m/%d")
stay.append(abs(diff.days+1))
return name,max(stay)
It always return first name.
This can also be implemented using pandas. I think it will much simpler using pandas.
One issue I find is that how you want to handle when you have many stayed for max nights. I have addressed that in the following code.
import pandas as pd
from datetime import datetime as dt
def longest_stay(fpath):
# Reads the text file as Dataframe
data = pd.read_csv(fpath + 'test.txt', sep=" ", header = None)
# adding column names to the Data frame
data.columns = ['Name', 'a_date', 'd_date']
# Calculating the nights for each customer
data['nights'] = datetime.strptime(d_date, "%m/%d") - datetime.strptime(a_date, "%m/%d")
# Slicing the data frame by applying the condition and getting the Name of the customer and nights as a tuple (as expected)
longest_stay = tuple( data.ix[data.nights == data.nights.max(), {'Name', 'nights'}])
# In case if many stayed for the longest night. Returns a list of tuples.
longest_stay = [tuple(x) for x in longest_stay]
return longest_stay
Your code fails but not storing the first name, it is because name is going to be set to the last name in the file because you only store the days as you go, hence you always see the last name.
You also add + 1 which does not seem correct as you should not be adding or including the last day as the person does not stay that night. Your code would actually output ('John', 32) the correct name by chance because it is the last in your sample file and the day off by 1.
Just keep track of the best which includes the name and day count as you go using the days stayed as the measure and return that at the end:
from datetime import datetime
from csv import reader
def longest_stay(fpath):
with open(fpath,'r')as f_handle:
mx,best = None, None
for name, a_date, d_date in reader(f_handle,delimiter=" "):
days = (datetime.strptime(d_date, "%m/%d") - datetime.strptime(a_date, "%m/%d")).days
# first iteration or we found
if best is None or mx < days:
best = name, days
return best
Outout:
In [13]: cat test.txt
Robin 01/11 01/15
Mike 02/10 02/12
John 01/15 02/15
In [14]: longest_stay("test.txt")
# 31 days not including the last day as a stay
Out[14]: ('John', 31)
You only need to use abs if the format is not always in the format start-end but be aware would could get the wrong output using the abs value if your dates had years.