Pandas read_csv not working when items missing from last column - python-3.x

I'm having problems reading in the following three seasons of data (all the seasons after these ones load without problem).
import pandas as pd
import itertools
alphabets = ['a','b', 'c', 'd']
keywords = [''.join(i) for i in itertools.product(alphabets, repeat = 3)]
col_names = keywords[:57]
seasons = [2002, 2003, 2004]
for season in seasons):
df = pd.read_csv("https://www.football-data.co.uk/mmz4281/{}{}/E0.csv".format(str(season)[-2:], str(season+1)[-2:]), names=col_names).dropna(how='all')
This gives the following error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 57 fields in line 337, saw 62
I have looked on stack overflow for problems that have a similar error code (see below)but none seem to offer a solution that fits my problem.
Python Pandas Error tokenizing data
I'm pretty sure the error is caused when there is missing data in the last column, however I don't know how to fix it, can someone please explain how to do this?
Thanks
Baz
UPDATE:
The amended code now works for seasons 2002 and 2003. However 2004 is now producing a new error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte
Following the answer below from Serge Ballesta option 2:
UnicodeDecodeError when reading CSV file in Pandas with Python
df = pd.read_csv("https://www.football-data.co.uk/mmz4281/{}{}/E0.csv".format(str(season)[-2:], str(season+1)[-2:]), names=col_names, encoding = "latin1").dropna(how='all')
With the above amendment the code also works for season=2004.
I still have two questions though:
Q1.) How can I find which character/s were causing the problem is season 2004?
Q2.) Is it safe to use the 'latin1' encoding for every season even though there wre originally encoded at 'utf-8>

Related

getting a error 'bool' object is not iterable

want to count the number of occurrence to be true which is
satisfying below condition ,but it not showing the value in true or
false but not able to count it,I don't know where i am doing wrong please have a look into my below code
from openpyxl import load_workbook
import openpyxl as xl
from openpyxl.utils.dataframe import dataframe_to_rows
import pandas as pd
import os
import xlwings as xw
import datetime
filelist_patch=[f for f in os.listdir() if f.endswith(".xlsx") and 'SEL' in f.upper() and '~' not in f.upper()]
print(filelist_patch[0])
wb = xl.load_workbook(filelist_patch[0],read_only=True,data_only=True)
wb_device=wb["de_exp"]
cols_device = [0,9,14,18,19,20,21,22,23,24,25]
#######################average count in vuln##############################
for row in wb_device.iter_rows(max_col=25):
cells = [cell.value for (idx, cell) in enumerate(row) if (
idx in cols_device and cell.value is not None)]
os_de=cells[1]
qca_de=cells[2]
file_data =((os_de=="cl") & (qca_de=='Q'))
print(sum(file_data))
getting a type error
TypeError Traceback (most recent call last)
<ipython-input-70-735a490062da> in <module>
30 file_data =((os_de=="client") & (qca_de=='Q'))(here i want to count the number of occurence that is in true
---> 31 print(sum(file_data))
32
33
TypeError: 'bool' object is not iterable
Your question is very hard to read. Please use proper grammar and punctuation. I understand that you are probably not a native speaker, but that is no excuse to not form proper sentences that start with a capital letter and end with a period.
Please sort your own thoughts before you ask a question, and then write your thoughts in multiple short an concise sentenses.
Nonetheless, I'll try to guess what you are trying to say.
90% of your code is unrelated to the question, so I'll try to reform your question. If my guess is incorrect, of course my answer will be worthless, and I'd ask you to reword your question to be more precise.
Reworded Question
Question: How to count the number of true statements in a number of conditions?
Details: Given a number of conditions (like os_de=="client" and qca_de=='Q'), how do I count the number of correct ones among them?
Attempt:
# Dummy data, to make this a short and reproducable example:
cells = ["", "cl", "D"]
os_de = cells[1]
qca_de = cells[2]
file_data = ((os_de=="cl") & (qca_de=='Q'))
print(sum(file_data))
Expected result value: 1
Actual result:
TypeError Traceback (most recent call last)
<ipython-input-70-735a490062da> in <module>
30 file_data = ((os_de=="client") & (qca_de=='Q'))
---> 31 print(sum(file_data))
32
33
TypeError: 'bool' object is not iterable
Answer
Both (os_de=="client") and (qca_de=='Q') are of type boolean.
Using & on them makes the result still be a boolean. So if you try to use sum() on it, it rightfully complains that the sum of a boolean does not make sense.
Sum can only be done over a collection of numbers.
You are almost there, though. Just instead of combining them with &, make them a list instead.
# Dummy data, to make this a short and reproducable example:
cells = ["", "cl", "D"]
os_de = cells[1]
qca_de = cells[2]
file_data = [(os_de=="cl"), (qca_de=='Q')]
print(sum(file_data))
Which prints 1, as expected: https://ideone.com/96Rghq
Try to include an ideone.com link in your questions in the future, this forces you to make your example code complete, simple and reproducable.

covert ascii to decimal python

I have a data pandas DataFrame, where one of the columns is filled with ascii characters. I'm trying to convert this column from ascii to decimal, where, for example, the following string should be converted from in Hex:
313533313936393239382e323834303638
to:
1531969298.284068
I've tried
outf['data'] = outf['data'].map`(`lambda x: bytearray.fromhex(x).decode())
as well as
outf['data'] = outf['data'].map(lambda x: ascii.fromhex(x).decode())
The error that I get is as follows:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 8: invalid start byte
I'm not sure where the problem manifests itself. I have a txt file and a sample of its contents are as follows:
data time
313533313936393239382e32373737343800 1531969299.283273000
313533313936393239382e32373838303400 1531969299.284253000
313533313936393239382e32373938353700 1531969299.285359000
When the data was normal integers the the lambda would work fine where I used:
outf['data'] = outf['data'].astype(str)
outf['data'] = outf['data'].str[:-2:]
outf['data'] = outf['data'].map(lambda x: bytearray.fromhex(x).decode())
outf['data'] = outf['data'].astype(int)
, however now it says there's something wrong with the encoding.
I've looked on Stackoverflow, but perhaps I wasn't able to find something similar.
However, it hasn't worked. If someone where to help me out, I would very much appreciate it.
You can use map with a lambda function for bytearray.fromhex and astype to float.
out['data'].map(lambda x: bytearray.fromhex(x).decode()).astype(float)
Such lambda would do the trick:
>>> f = lambda v: float((bytearray.fromhex(v)))
>>> f('313533313936393239382e323834303638')
1531969298.284068
Note that the use of numpy's astype hinted by Scott Boston in the comment section may be better performance-wise.

Python 3 OutOfBoundsDatetime: Out of bounds nanosecond timestamp: (Workaround)

Encountered an error today involving importing a CSV file with dates. The file has known quality issues and in this case one entry was "3/30/3013" due to a data entry error.
Reading other entries about the OutOfBoundsDatetime error, datetime's upper limit maxes out at 4/11/2262. The suggested solution was to fix the formatting of the dates. In my case the date format is correct but the data is wrong.
Applying numpy logic:
df['Contract_Signed_Date'] = np.where(df['Contract_Signed_Date']>'12/16/2017',
df['Alt_Date'],df['Contract_Signed_Date'])
Essentially if the file's 'Contract Signed Date' is greater than today (being 12/16/2017), I want to use the Alt_Date column instead. It seems to work except when it hits the year 3013 entry it errors out. Whats a good pythonic way around the out of bounds error?
Perhaps hideously unpythonic but it appears to do what you want.
Input, file arthur.csv:
input_date,var1,var2
3/30/3013,2,34
02/2/2017,17,35
Code:
import pandas as pd
from io import StringIO
target_date='2017-12-17'
for_pandas = StringIO()
print ('input_date,var1,var2,alt_date', file=for_pandas) #new header
with open('arthur.csv') as arthur:
next(arthur) #skip header in csv
for line in arthur:
line_items = line.rstrip().split(',')
date = '{:4s}-{:0>2s}-{:0>2s}'.format(*list(reversed(line_items[0].split('/'))))
if date>target_date:
output = '{},{},{},{}'.format(*['NaT',line_items[1],line_items[2],date])
else:
output = '{},{},{},{}'.format(*[date,line_items[1],line_items[2],'NaT'])
print(output, file=for_pandas)
for_pandas.seek(0)
df = pd.read_csv(for_pandas, parse_dates=['input_date', 'alt_date'])
print (df)
Output:
0 NaT 2 34 3013-30-03
1 2017-02-02 17 35 NaT

Problems with graphing excel data off an internet source with dates

this is my first post on stackoveflow and I'm pretty new to programming especially python. I'm in engineering and am learning python to compliment that going forward, mostly at math and graphing applications.
Basically my question is how do I download csv excel data off a source (in my case stock data from google), and plot only certain rows against the date. For myself I want the date against the close value.
Right now the error message I'm getting is timedata '5-Jul-17' does not match '%d-%m-%Y'
previously I was also getting tuple data does not match
The description of the opened csv data in excel is
[7 columns (Date,Open,High,Low,Close,AdjClose,Volume, and the date is organized as 2017-05-30][1]
I'm sure there are other errors as well unfortunately
I would really be grateful for any help on this,
thank you in advance!
--edit--
Upon fiddling some more I don't think names and dtypes are necessary, when I check the matrix dimensions without those identifiers I get (250L, 6L) which seems right. Now my main problem is coverting the dates to something usable, My error now is strptime only accepts strings, so I'm not sure what to use. (see updated code below)
import matplotlib.pyplot as plt
importnumpy as np
from datetime import datetime
def graph_data(stock):
%getting the data off google finance
data = np.genfromtxt('urlgoeshere'+stock+'forthecsvdata', delimiter=',',
skip_header=1)
# checking format of matrix
print data.shape (returns 250L,6L)
time_format = '%d-%m-%Y'
# I only want the 1st column (dates) and 5 column (close), all rows
date = data[:,0][:,]
close = data[:,4][:,]
dates = [datetime.strptime(date, time_format)]
%plotting section
plt.plot_date(dates,close, '-')
plt.legend()
plt.show()
graph_data('stockhere')
Assuming the dates in the csv file are in the format '5-Jul-17', the proper format string to use is %d-%b-%y.
In [6]: datetime.strptime('5-Jul-17','%d-%m-%Y')
ValueError: time data '5-Jul-17' does not match format '%d-%m-%Y'
In [7]: datetime.strptime('5-Jul-17','%d-%b-%y')
Out[7]: datetime.datetime(2017, 7, 5, 0, 0)
See the Python documentation on strptime() behavior.

ValueError, though check has already be performed for this

Getting a little stuck with NaN data. This program trawls through a folder in an external hard drive loads in a txt file as a dataframe, and should reads the very last value of the last column. As some of the last rows do not complete for what ever reason, i have chosen to take the row before (or that's what i hope to have done. Here is the code and I have commented the lines that I think are giving the trouble:
#!/usr/bin/env python3
import glob
import math
import pandas as pd
import numpy as np
def get_avitime(vbo):
try:
df = pd.read_csv(vbo,
delim_whitespace=True,
header=90)
row = next(df.iterrows())
t = df.tail(2).avitime.values[0]
return t
except:
pass
def human_time(seconds):
secs = seconds/1000
mins, secs = divmod(secs, 60)
hours, mins = divmod(mins, 60)
return '%02d:%02d:%02d' % (hours, mins, secs)
def main():
path = 'Z:\\VBox_Backup\\**\\*.vbo'
events = {}
customers = {}
for vbo_path in glob.glob(path, recursive=True):
path_list = vbo_path.split('\\')
event = path_list[2].upper()
customer = path_list[3].title()
avitime = get_avitime(vbo_path)
if not avitime: # this is to check there is a number
continue
else:
if event not in events:
events[event] = {customer:avitime}
print(event)
elif customer not in events[event]:
events[event][last_customer] = human_time(events[event][last_customer])
print(events[event][last_customer])
events[event][customer] = avitime
else:
total_time = events[event][customer]
total_time += avitime
events[event][customer] = total_time
last_customer = customer
events[event][customer] = human_time(events[event][customer])
df_events = pd.DataFrame(events)
df.to_csv('event_track_times.csv')
main()
I put in a line to check for a value, but I am guessing that NaN is not a null value, hence it hasn't quite worked.
C:\Users\rob.kinsey\AppData\Local\Continuum\Anaconda3) c:\Users\rob.kinsey\Pro
ramming>python test_single.py
BARCELONA
03:52:42
02:38:31
03:21:02
00:16:35
00:59:00
00:17:45
01:31:42
03:03:03
03:16:43
01:08:03
01:59:54
00:09:03
COTA
04:38:42
02:42:34
sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
04:01:13
01:19:47
03:09:31
02:37:32
03:37:34
02:14:42
04:53:01
LAGUNA_SECA
01:09:10
01:34:31
01:49:27
03:05:34
02:39:03
01:48:14
SILVERSTONE
04:39:31
01:52:21
02:53:42
02:10:44
02:11:17
02:37:11
01:19:12
04:32:21
05:06:43
SPA
Traceback (most recent call last):
File "test_single.py", line 56, in <module>
main()
File "test_single.py", line 41, in main
events[event][last_customer] = human_time(events[event][last_customer])
File "test_single.py", line 23, in human_time
The output is starting out correctly, except for the sys:1 error, but at least it carries on, and the final error that stalls the program completely. How can I get past this NaN issue, all variables I am working with should be of float data type or should have been ignored. All data types should only be strings or floats until the time conversion which are integers.
Ok, even though no one answered, I am compelled to answer my own question as I am not convinced I am the only person that has had this problem.
There are 3 main reasons for receiving NaN in a data frame, most of these revolve around infinity, such as using 'inf' as a value or dividing by zero, which will also provide NaN as a result, the wiki page was the most helpful for me in solving this issue:
https://en.wikipedia.org/wiki/NaN
One other important point about NaN it that is works a little like a virus, in that anything that touches it in any calculation will result in NaN, so the problem can get exponentially worse. Actually what you are dealing with is missing data, and until you realize that's what it is, NaN is the least useful and frustrating thing as it comes under a datatype not an error yet any mathematical operations will end in NaN. BEWARE!!
The reason on this occasion is because a specific line was used to get the headers when reading in the csv file and although that worked for the majority of these files, some of them had the headers I was after on a different line, as a result, the headers being imported into the data frame either were part of the data itself or a null value. As a result, trying to access a column in the data frame by header name resulted in NaN, and as discussed earlier, this proliferated though the program causing a few problems which had used workarounds to combat, one of which was actually acceptable which is to add this line:
df = df.fillna(0)
after the first definition of the df variable, in this case:
df= pd.read_csv(vbo,
delim_whitespace=True,
header=90)
The bottom line is that if you are receiving this value, the best thing really is to work out why you are getting NaN in the first place, then it is easier to make an informed decision as to whether or not replacing NaN with '0' is a viable choice.
I sincerely hope this helps anyone who finds it.
Regards
iFunction

Resources