'NA' handling in python pandas - python-3.x

i have a dataframe with name,age fieldname,name column has missing value and NA when i read the value using pd.read_excel,missing value and NA become NaN,how can i avoid this issue.
this is my code
import pandas as pd
data = {'Name':['Tom', '', 'NA','', 'Ricky',"NA",''],'Age':[28,34,29,42,35,33,40]}
df = pd.DataFrame(data)
df.to_excel("test1.xlsx",sheet_name="test")
import pandas as pd
data=pd.read_excel("./test1.xlsx")

To avoid this, just set the keep_default_na to False:
df = pd.read_excel('test1.xlsx', keep_default_na=False)

Related

Pandas - dataframe with column with dictionary, save value instead

I have the below stories_data dictionary, which I'm able to create a df from but since owner is a dictionary as well I would like to get the value of that dictionary so the owner column would have 178413540
import numpy as np
import pandas as pd
stories_data = {'caption': 'Tel_gusto', 'like_count': 0, 'owner': {'id': '178413540'}, 'headers': {'Content-Encoding': 'gzip'}
x = pd.DataFrame(stories_data.items())
x.set_index(0, inplace=True)
stories_metric_df = x.transpose()
del stories_metric_df['headers']
I've tried this but it gets the key not the value
stories_metric_df['owner'].explode().apply(pd.Series)
You can use .str, even for objects/dicts:
stories_metric_df['owner'] = stories_metric_df['owner'].str['id']
Output:
>>> stories_metric_df
0 caption like_count owner
1 Tel_gusto 0 178413540
Another solution would be to skip the explode, and just extract id:
stories_metric_df['owner'].apply(pd.Series)['id']
although I suspect my first solution would be faster.

Iteratively append new data into pandas dataframe column and join with another dataframe

I have been doing data extract from many API. I would like to add a common column among all APIs.
And I have tried below
df = pd.DataFrame()
for i in range(1,200):
url = '{id}/values'.format(id=i)
res = request.get(url,headers=headers)
if res.status_code==200:
data =json.loads(res.content.decode('utf-8'))
if data['success']:
df['id'] = i
test = pd.json_normalize(data[parent][child])
df = df.append(test,index=False)
But data-frame id column I'm getting only the last iterated id only. And in case of APIs has many rows I'm getting invalid data.
From performance reasons it would be better first storing data in a dictionary and then create from this dictionary dataframe:
import pandas as pd
from collections import defaultdict
d = defaultdict(list)
for i in range(1,200):
# simulate dataframe retrieved from pd.json_normalize() call
row = pd.DataFrame({'id': [i], 'field1': [f'f1-{i}'], 'field2': [f'f2-{i}'], 'field3': [f'f3-{i}']})
for k, v in row.to_dict().items():
d[k].append(v[0])
df = pd.DataFrame(d)

Groupby and transpose or unstack in Pandas

I have the following Python pandas dataframe:
There are more EventName's than shown on this date.
Each will have Race_Number = 'Race 1', 'Race 2', etc.
After a while the date increments.
.
I'm trying to create a dataframe that looks like this:
Each race has different numbers of runners.
Is there a way to do this in pandas ?
Thanks
I assumed output would be another DataFrame.
import pandas as pd
import numpy as np
from nltk import flatten
import copy
df = pd.DataFrame({'EventName': ['sydney', 'sydney', 'sydney', 'sydney', 'sydney', 'sydney'],
'Date': ['2019-01.01', '2019-01.01', '2019-01.01', '2019-01.01', '2019-01.01', '2019-01.01'],
'Race_Number': ['Race1', 'Race1', 'Race1', 'Race2', 'Race2', 'Race3'],
'Number': [4, 7, 2, 9, 5, 10]
})
print(df)
dic={}
for rows in df.itertuples():
if rows.Race_Number in dic:
dic[rows.Race_Number] = flatten([dic[rows.Race_Number], rows.Number])
else:
dic[rows.Race_Number] = rows.Number
copy_dic = copy.deepcopy(dic)
seq = np.arange(0,len(dic.keys()))
for key, n_key in zip(copy_dic, seq):
dic[n_key] = dic.pop(key)
df = pd.DataFrame([dic])
print(df)

Pandas checks with prefix and more checksum if searched prefix exists or no data

I have below code snippet which works fine.
import pandas as pd
import numpy as np
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']
df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)
df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
df = df.rename_axis(None)
Example File new_hosts
sj000001
sj000002
sj000003
sj000004
sj124000
sj125000
sj126000
sj127000
sj128000
sj129000
sj130000
sj131000
sj132000
cr000011
cr000012
cr000013
cr000014
crn00001
crn00002
crn00003
crn00004
euk000011
eu0000012
eu0000013
eu0000014
eu5000011
eu5000013
eu5000014
eu5000015
Current output:
sj00 sj12 cr00 cr08 eu00 eu50
sj000001 cr000011 crn00001 euk000011 eu5000011
sj000002 cr000012 crn00002 eu0000012 eu5000013
sj000003 cr000013 crn00003 eu0000013 eu5000014
sj000004 cr000014 crn00004 eu0000014 eu5000015
What's expected:
1) As code works fine but as you see the current output the second column don't have any values but still appearing So, how could i have a checksum if a particular column don't have any values then remove that from display.
2) Can we place a check for the prefixes if they exists in the dataframe before processing to avoid the error.
Appreciate any help.
IIUC, before
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
you can do:
# remove all empty columns
df = df.dropna(axis=1, how='all')
That would solve your first part. Second part can be reindex?
# select prefixes:
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50', 'sh00', 'dt00', 'sh00', 'dt00']
df = df.reindex(prefixes, axis=1).dropna(axis=1, how='all').replace(np.nan, '', regex=True)
Note the axis=1, not axis=0 is identical to what I propose for question 1.
Much thanks to Quang Hoang for the hints on the post, Just for the workaround, i got it working as follows until i get a better answer:
# Select prefixes
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']
df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)
df = df[prefixes]
# For column `sj12` only extract the values having `sj12` and a should be a word immediately after that like `sj12[a-z]`
df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)
df.replace('', np.nan, inplace=True)
# Remove the empty columns
df = df.dropna(axis=1, how='all')
# again drop if all values in the row are nan and replace nan to empty for live columns
df = df.dropna(axis=0, how='all').replace(np.nan, '', regex=True)
# drop the index field
df = df.rename_axis(None)
print(df)

Inputting null values through np.select's "default" parameter

Trying to write values to a column given certain conditions, with default as Null value with the following code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col': list('ABCDE')})
cond1 = df['col'].eq('A')
cond2 = df['col'].isin(['B', 'E'])
df['new_col'] = np.select([cond1, cond2], ['foo', 'bar'], default=np.NaN)
But it gives 'nan' as string value in the column.
df['new_col'].unique()
#array(['foo', 'bar', 'nan'], dtype=object)
Is there a way to directly change it to null from this code?
Found the correct solution, which uses None as the default value:
df['new_col'] = np.select([cond1, cond2], ['foo', 'bar'], default=None)
Just tested it myself and it behaves properly. Check output of np.select(conditions,choices,default=np.nan) manually, maybe there're "NaN" strings in choices somewhere.
Try specifying dropna=True manually in .value_counts(), maybe it's set to default False smh?
What I tested it with:
import numpy as np
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris['sepal_length'] = np.select(iris.values[:,:4].T>5, iris.values[:,:4].T, default=np.nan)
print(iris['sepal_length'].value_counts())
print(iris.sepal_length.value_counts(dropna=False))

Resources