Hypothesis and empty-ish DataFrames - python-hypothesis

I'm using Hypothesis to test dataframes, and when they're "empty-ish" I'm getting some unexpected behavior.
In the example below, I have a dataframe of all nans, and it's getting viewed as a NoneType object rather than a dataframe (and thus it has no attribute notnull()):
Falsifying example: test_merge_csvs_properties(input_df_dict= {'googletrend.csv': file week trend
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN 5 NaN NaN NaN}
<snip>
Traceback (most recent call last):
File "/home/chachi/Capstone-SalesForecasting/tests/test_make_dataset_with_composite.py", line 285, in test_merge_csvs_properties
input_dataframe, df_dict = make_dataset.merge_csvs(input_df_dict)
File "/home/chachi/Capstone-SalesForecasting/tests/../src/data/make_dataset.py", line 238, in merge_csvs
if dfs_dict['googletrend.csv'].notnull().any().any():
AttributeError: 'NoneType' object has no attribute 'notnull'
Compare to ipython session, where a dataframe of all nans is still a dataframe:
>>> import pandas as pd
>>> import numpy as np
>>> tester = pd.DataFrame({'test': [np.NaN]})
>>> tester
test
0 NaN
>>> tester.notnull().any().any()
False
I'm explicitly testing for notnull() to allow for all sorts of pathological examples. Any suggestions?

It looks like you've somehow ended up with None instead of a dataframe as that value in the input_dfs_dict. Can you post the full test you're using, or at least the function definition and strategy? The traceback alone doesn't really have enough information to tell what's happening. Quick things to check:
Can the strategy generate None instead of a dataframe here? If so, there's no mystery because Hypothesis is reporting that it can trigger an AttributeError.
If not, can you write a simpler test with only the logic and strategy for this dataframe?

Related

Calculation of percentile and mean

I want to find the 3% percentile of the following data and then average the data.
Given below is the data structure.
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
96927 NaN
96928 NaN
96929 NaN
96930 NaN
96931 NaN
Over here the concerned data lies exactly between the data from 13240:61156.
Given below is my code:
enter code here
import pandas as pd
import numpy as np
load_var=pd.read_excel(r'path\file name.xlsx')
load_var
a=pd.DataFrame(load_var['column whose percentile is to be found'])
print(a)
b=np.nanpercentile(a,3)
print(b)
Please suggest the changes in the code.
Thank you.
Use Series.quantile with mean in Series.agg:
df = pd.DataFrame({
'col':[7,8,9,4,2,3, np.nan],
})
f = lambda x: x.quantile(0.03)
f.__name__ = 'q'
s = df['col'].agg(['mean', f])
print (s)
mean 5.50
q 2.15
Name: col, dtype: float64

How do i remove nan values from dataframe in Python. dropna() does not seem to be working for me

How do i remove nan values from dataframe in Python? I already tried with dropna(), but that did not work for me. Also is NaN diffferent from nan. I am using Pandas.
While printing the data frame it does not print as NaN but instead as nan.
1 2.11358 0.649067060588935
2 nan 0.6094130485307419
3 2.10066 0.3653980276694516
4 2.10545 nan
You can change nan values with NaN using replace() and then use dropna().
import numpy as np
df = df.replace('nan', np.nan)
df = df.dropna()
Update:
Original dataframe:
1 2.11358 0.649067060588935
2 nan 0.6094130485307419
3 2.10066 0.3653980276694516
4 2.10545 nan
Applied df.replace('nan', np.nan):
1 2.11358 0.649067060588935
2 NaN 0.6094130485307419
3 2.10066 0.3653980276694516
4 2.10545 NaN
Applied df.dropna():
1 2.11358 0.649067060588935
3 2.10066 0.3653980276694516

Python Pandas read_csv(): Incorectly loaded csv

according to the Pandas documentation, pandas.read_csv (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) should support detection of bad lines though error_bad_lines and warn_bad_lines set to True.
So I have created a csv with bad format called test.csv:
aaa,bbb,ccc
ssdf,sdtf,aesrt,,,,
eart,erate
aert,aert,aert
and run read_csv:
>>> pd.read_csv('test.csv', error_bad_lines = True )
aaa bbb ccc
ssdf sdtf aesrt NaN NaN NaN NaN
eart erate NaN NaN NaN NaN NaN
As I understand the documentation, this should raise an Error, except it doesn't. Instead bad csv is loaded. Pandas seems to ignore all of error_bad_lines / warn_bad_lines.
Is my understanding of the documentation wrong, or is it really a bug in Pandas? Does anyone know of elegant work-around how to load only correct csv?
I'm working with Python 3.6.8, Pandas 0.25.0 and Ubuntu 18.04.
I made some tests and found that the second line will determine the number of columns to be expected in the rest of the file.
for example, the second line (ssdf,sdtf,aesrt,,,,) has 7 columns. So if all the following rows have less than 7 columns, then no errors!
If you modify one row to be 7 or more, then it will crash. The default value for error_bad_lines is true, so you don't need to specify it explicitly.
Example without an error :
data.csv :
0,jjjjjj
1,er,ate,, # 5 columns
2,bb,b
3,sdtf,aesrt,ll,sdfd # 5 columns, so no errors appear.
4,erate,
5,aert,aert
df1 = pd.read_csv('data.csv')
df1
Result : No error
0 jjjjjj
1 er ate NaN NaN
2 bb b NaN NaN
3 sdtf aesrt ll sdfd
4 erate NaN NaN NaN
5 aert aert NaN NaN
Example with an error :
data.csv :
0,jjjjjj
1,er,ate,, # 5 columns
2,bb,b
3,sdtf,aesrt,ll,sdfd,sdf,sdf,sdf,sdf, # more than 5 columns
4,erate,
5,aert,aert
df1 = pd.read_csv('data.csv')
df1
Result : error !!
ParserError: Error tokenizing data. C error: Expected 5 fields in line 4, saw 10

Pandas array from url - only NaN values

I'm encountering a little problem when trying to download data from an url and putting it into an array using pandas:
When I simply do the following:
EUR_USD_TICK = pd.read_csv('https://tickdata.fxcorporate.com/EURUSD/2015/1.csv.gz')
I get this error:
CParserError: Error tokenizing data. C error: Expected 2 fields in line 14, saw 3
Which I assume is due to the fact that the data retrieved from the url is compressed.
So I tried to set the compression to gzip:
EUR_USD_TICK = pd.read_csv('https://tickdata.fxcorporate.com/EURUSD/2015/1.csv.gz',compression='gzip')
When doing that, I get no error message, however, the array is only composed by NaN values:
D Unnamed: 1 Unnamed: 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
The length of the array matches the length of the csv file, so I have no idea why the array is filed with NaN values.
Would anyone have an idea of why I'm getting this issue?
Thanks
Ps: The csv data that I'm trying to download:
DateTime,Bid,Ask
01/04/2015 22:00:00.389,1.19548,1.19557
01/04/2015 22:00:00.406,1.19544,1.19556
01/04/2015 22:00:00.542,1.19539,1.19556
01/04/2015 22:00:00.566,1.19544,1.19556
It seems that there are some characters in the file that Python/pandas doesn't like ('\x00').
So I used the gzip module to manually read the file and then removed those characters. After that, Pandas reads the file without issues.
import pandas as pd
import requests
from io import StringIO,BytesIO
import gzip
r = requests.get('https://tickdata.fxcorporate.com/EURUSD/2015/1.csv.gz')
gzip_file = gzip.GzipFile(fileobj=BytesIO(r.content))
pd.read_csv(StringIO(gzip_file.read().decode('utf-8').replace('\x00','')))

Generate daterange and insert in a new column of a dataframe

Problem statement: Create a dataframe with multiple columns and populate one column with daterange series of 5 minute interval.
Tried solution:
Created a dataframe initially with just one row / 5 columns (all "NAN") .
Command used to generate daterange:
rf = pd.date_range('2000-1-1', periods=5, freq='5min').
O/P of rf :
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:05:00',
'2000-01-01 00:10:00', '2000-01-01 00:15:00',
'2000-01-01 00:20:00'],
dtype='datetime64[ns]', freq='5T')
When I try to assign rf to one of the columns of df (df['column1'] = rf)., it is throwing exception as shown below (copying the last line of exception).
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.6/site-packages/pandas/core/series.py", line 2879, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
Though I understood the issue, I don't know the solution. I'm looking for a easy way to achieve this.
I think, I was slowly understanding the power/usage of dataframes.
Initially create a dataframe :
df = pd.DataFrame(index=range(100),columns=['A','B','C'])
Then created a date_range.
date = pd.date_range('2000-1-1', periods=100, freq='5T')
Using "assign" function , added date_range as new column to already created dataframe (df).
df = df.assign(D=date)
Final O/P of df:
df[:5]
A B C D
0 NaN NaN NaN 2000-01-01 00:00:00
1 NaN NaN NaN 2000-01-01 00:05:00
2 NaN NaN NaN 2000-01-01 00:10:00
3 NaN NaN NaN 2000-01-01 00:15:00
4 NaN NaN NaN 2000-01-01 00:20:00
Your dataframe has only one row and you try to insert data for five rows.

Resources