Python Pandas read_csv(): Incorectly loaded csv - python-3.x

according to the Pandas documentation, pandas.read_csv (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) should support detection of bad lines though error_bad_lines and warn_bad_lines set to True.
So I have created a csv with bad format called test.csv:
aaa,bbb,ccc
ssdf,sdtf,aesrt,,,,
eart,erate
aert,aert,aert
and run read_csv:
>>> pd.read_csv('test.csv', error_bad_lines = True )
aaa bbb ccc
ssdf sdtf aesrt NaN NaN NaN NaN
eart erate NaN NaN NaN NaN NaN
As I understand the documentation, this should raise an Error, except it doesn't. Instead bad csv is loaded. Pandas seems to ignore all of error_bad_lines / warn_bad_lines.
Is my understanding of the documentation wrong, or is it really a bug in Pandas? Does anyone know of elegant work-around how to load only correct csv?
I'm working with Python 3.6.8, Pandas 0.25.0 and Ubuntu 18.04.

I made some tests and found that the second line will determine the number of columns to be expected in the rest of the file.
for example, the second line (ssdf,sdtf,aesrt,,,,) has 7 columns. So if all the following rows have less than 7 columns, then no errors!
If you modify one row to be 7 or more, then it will crash. The default value for error_bad_lines is true, so you don't need to specify it explicitly.
Example without an error :
data.csv :
0,jjjjjj
1,er,ate,, # 5 columns
2,bb,b
3,sdtf,aesrt,ll,sdfd # 5 columns, so no errors appear.
4,erate,
5,aert,aert
df1 = pd.read_csv('data.csv')
df1
Result : No error
0 jjjjjj
1 er ate NaN NaN
2 bb b NaN NaN
3 sdtf aesrt ll sdfd
4 erate NaN NaN NaN
5 aert aert NaN NaN
Example with an error :
data.csv :
0,jjjjjj
1,er,ate,, # 5 columns
2,bb,b
3,sdtf,aesrt,ll,sdfd,sdf,sdf,sdf,sdf, # more than 5 columns
4,erate,
5,aert,aert
df1 = pd.read_csv('data.csv')
df1
Result : error !!
ParserError: Error tokenizing data. C error: Expected 5 fields in line 4, saw 10

Related

Pandas: Get substring between start and end of the characters

I am attempting to get substring between start and end of different characters. I tried several different regex notations, I am coming close to the output I need, but it is not fully correct. What can I do to fix this?
Data csv
ID,TEST
abc,1#London4#Harry Potter#5Rowling##
cde,6#Harry Potter1#England#5Rowling
efg,4#Harry Potter#5Rowling##1#USA
ghi,
jkm,4#Harry Potter5#Rowling
xyz,4#Harry Potter1#China#5Rowling
Code:
import pandas as pd
df = pd.read_csv('sample2.csv')
print(df)
Try:
df['TEST'].astype(str).str.extract('(1#.*(?=#))')
Got output from above code: It doesn't pick up end line '1#USA'
1#London4#Harry Potter#5Rowling#
1#England
NaN
NaN
NaN
1#China
Output needed:
1#London
1#England
1#USA
NaN
NaN
1#China
You can do this:
>>> df.TEST.str.extract("(1#[a-zA-Z]*)")
0
0 1#London
1 1#England
2 1#USA
3 NaN
4 NaN
5 1#China
You can try:
# capture all characters that are neither `#` nor digits
# following 1#
df['TEST'].str.extract('(1#[^#\d]+)', expand=False)
Output:
0 1#London
1 1#England
2 1#USA
3 NaN
4 NaN
5 1#China
Name: TEST, dtype: object

How do i remove nan values from dataframe in Python. dropna() does not seem to be working for me

How do i remove nan values from dataframe in Python? I already tried with dropna(), but that did not work for me. Also is NaN diffferent from nan. I am using Pandas.
While printing the data frame it does not print as NaN but instead as nan.
1 2.11358 0.649067060588935
2 nan 0.6094130485307419
3 2.10066 0.3653980276694516
4 2.10545 nan
You can change nan values with NaN using replace() and then use dropna().
import numpy as np
df = df.replace('nan', np.nan)
df = df.dropna()
Update:
Original dataframe:
1 2.11358 0.649067060588935
2 nan 0.6094130485307419
3 2.10066 0.3653980276694516
4 2.10545 nan
Applied df.replace('nan', np.nan):
1 2.11358 0.649067060588935
2 NaN 0.6094130485307419
3 2.10066 0.3653980276694516
4 2.10545 NaN
Applied df.dropna():
1 2.11358 0.649067060588935
3 2.10066 0.3653980276694516

How to compare PANDAS Columns in a Dataframes to find all entries appearing in different columns?

Full disclosure. I'm fairly new to Python and discovered PANDAS today.
I created a Dataframe from two csv files, one which is the results of a robot scanning barcode IDs and one which is a list of instructions for the robot to execute.
import pandas as pd
#import csv file and read the column containing plate IDs scanned by Robot
scancsvdata = pd.read_csv("G:\scan.csv", header=None, sep=';', skiprows=(1),usecols=[6])
#Rename Column to Plates Scanned
scancsvdata.columns = ["IDs Scanned"]
#Remove any Duplicate Plate IDs
scancsvdataunique = scancsvdata.drop_duplicates()
#import the Worklist to be executed CSV file and read the Source Column to find required Plates
worklistdataSrceID = pd.read_csv("G:\TestWorklist.CSV", usecols=["SrceID"])
#Rename SrceID Column to Plates Required
worklistdataSrceID.rename(columns={'SrceID':'IDs required'}, inplace=True)
#remove duplicates from Plates Required
worklistdataSrceIDunique = worklistdataSrceID.drop_duplicates()
#import the Worklist to be executed CSV file and read the Destination Column to find required Plates
worklistdataDestID = pd.read_csv("G:\TestWorklist.CSV", usecols=["DestID"])
#Rename DestID Column to Plates Required
worklistdataDestID.rename(columns={'DestID':'IDs required'}, inplace=True)
#remove duplicates from Plates Required
worklistdataDestIDunique = worklistdataDestID.drop_duplicates()
#Combine into one Dataframe
AllData = pd.concat ([scancsvdataunique, worklistdataSrceIDunique, worklistdataDestIDunique], sort=True)
print (AllData)
The resulting Dataframe lists IDs scanned in Column 1 and IDs required in Column 2.
IDs Scanned IDs required
0 1024800.0 NaN
1 1024838.0 NaN
2 1024839.0 NaN
3 1024841.0 NaN
4 1024844.0 NaN
5 1024798.0 NaN
6 1024858.0 NaN
7 1024812.0 NaN
8 1024797.0 NaN
9 1024843.0 NaN
10 1024840.0 NaN
11 1024842.0 NaN
12 1024755.0 NaN
13 1024809.0 NaN
14 1024810.0 NaN
15 8656.0 NaN
16 8657.0 NaN
17 8658.0 NaN
0 NaN 1024800.0
33 NaN 1024843.0
0 NaN 8656.0
7 NaN 8657.0
15 NaN 8658.0
How would I go about ensuring that all the IDs in the 'IDs Required' Column, appear in the 'IDs Scanned Column'?
Ideally the results of the comparison above would be a generic message like 'All IDs found'.
If different csv files were used and the Dataframe was as follows
IDs Scanned IDs required
0 1024800.0 NaN
1 1024838.0 NaN
2 1024839.0 NaN
3 1024841.0 NaN
4 1024844.0 NaN
5 1024798.0 NaN
6 1024858.0 NaN
7 1024812.0 NaN
8 1024797.0 NaN
9 1024843.0 NaN
10 1024840.0 NaN
11 1024842.0 NaN
12 1024755.0 NaN
13 1024809.0 NaN
14 1024810.0 NaN
15 8656.0 NaN
16 8657.0 NaN
17 8658.0 NaN
0 NaN 2024800.0
33 NaN 2024843.0
0 NaN 8656.0
7 NaN 8657.0
15 NaN 8658.0
Then the result of the comparison would be the list of the missing IDs, 2024800 and 2024843.
To check True/False if all the items required are in the column;
all([item in df["IDs Scanned"] for item in df["IDs required"].unique()])
To get a list of the unique missing items:
sorted(set(df["IDs required"]) - set(df["IDs Scanned"]))
Or using pandas syntax to return a DataFrame filtered to rows where IDs required are not found in IDs Scanned:
df.loc[~df["IDs required"].isin(df["IDs Scanned"])]
missing_ids = df.loc[~df['IDs required'].isin(df['IDs Scanned']), 'IDs required']

Pandas append returns DF with NaN values

I'm appending data from a list to pandas df. I keep getting NaN in my entries.
Based on what I've read I think I might have to mention the data type for each column in my code.
dumps = [];features_df = pd.DataFrame()
for i in range (int(len(ids)/50)):
dumps = sp.audio_features(ids[i*50:50*(i+1)])
for i in range (len(dumps)):
print(list(dumps[0].values()))
features_df = features_df.append(list(dumps[0].values()), ignore_index = True)
Expected results, something like-
[0.833, 0.539, 11, -7.399, 0, 0.178, 0.163, 2.1e-06, 0.101, 0.385, 99.947, 'audio_features', '6MWtB6iiXyIwun0YzU6DFP', 'spotify:track:6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/tracks/6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/audio-analysis/6MWtB6iiXyIwun0YzU6DFP', 149520, 4]
for one row.
Actual-
danceability energy ... duration_ms time_signature
0 NaN NaN ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN NaN ... NaN NaN
3 NaN NaN ... NaN NaN
4 NaN NaN ... NaN NaN
5 NaN NaN ... NaN NaN
For all rows
append() strategy in a tight loop isn't a great way to do this. Rather, you can construct an empty DataFrame and then use loc to specify an insertion point. The DataFrame index should be used.
For example:
import pandas as pd
df = pd.DataFrame(data=[], columns=['n'])
for i in range(100):
df.loc[i] = i
print(df)
time python3 append_df.py
n
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
real 0m13.178s
user 0m12.287s
sys 0m0.617s
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

Pandas array from url - only NaN values

I'm encountering a little problem when trying to download data from an url and putting it into an array using pandas:
When I simply do the following:
EUR_USD_TICK = pd.read_csv('https://tickdata.fxcorporate.com/EURUSD/2015/1.csv.gz')
I get this error:
CParserError: Error tokenizing data. C error: Expected 2 fields in line 14, saw 3
Which I assume is due to the fact that the data retrieved from the url is compressed.
So I tried to set the compression to gzip:
EUR_USD_TICK = pd.read_csv('https://tickdata.fxcorporate.com/EURUSD/2015/1.csv.gz',compression='gzip')
When doing that, I get no error message, however, the array is only composed by NaN values:
D Unnamed: 1 Unnamed: 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
The length of the array matches the length of the csv file, so I have no idea why the array is filed with NaN values.
Would anyone have an idea of why I'm getting this issue?
Thanks
Ps: The csv data that I'm trying to download:
DateTime,Bid,Ask
01/04/2015 22:00:00.389,1.19548,1.19557
01/04/2015 22:00:00.406,1.19544,1.19556
01/04/2015 22:00:00.542,1.19539,1.19556
01/04/2015 22:00:00.566,1.19544,1.19556
It seems that there are some characters in the file that Python/pandas doesn't like ('\x00').
So I used the gzip module to manually read the file and then removed those characters. After that, Pandas reads the file without issues.
import pandas as pd
import requests
from io import StringIO,BytesIO
import gzip
r = requests.get('https://tickdata.fxcorporate.com/EURUSD/2015/1.csv.gz')
gzip_file = gzip.GzipFile(fileobj=BytesIO(r.content))
pd.read_csv(StringIO(gzip_file.read().decode('utf-8').replace('\x00','')))

Resources