Inputting null values through np.select's "default" parameter - python-3.x

Trying to write values to a column given certain conditions, with default as Null value with the following code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col': list('ABCDE')})
cond1 = df['col'].eq('A')
cond2 = df['col'].isin(['B', 'E'])
df['new_col'] = np.select([cond1, cond2], ['foo', 'bar'], default=np.NaN)
But it gives 'nan' as string value in the column.
df['new_col'].unique()
#array(['foo', 'bar', 'nan'], dtype=object)
Is there a way to directly change it to null from this code?

Found the correct solution, which uses None as the default value:
df['new_col'] = np.select([cond1, cond2], ['foo', 'bar'], default=None)

Just tested it myself and it behaves properly. Check output of np.select(conditions,choices,default=np.nan) manually, maybe there're "NaN" strings in choices somewhere.
Try specifying dropna=True manually in .value_counts(), maybe it's set to default False smh?
What I tested it with:
import numpy as np
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris['sepal_length'] = np.select(iris.values[:,:4].T>5, iris.values[:,:4].T, default=np.nan)
print(iris['sepal_length'].value_counts())
print(iris.sepal_length.value_counts(dropna=False))

Related

Pandas - dataframe with column with dictionary, save value instead

I have the below stories_data dictionary, which I'm able to create a df from but since owner is a dictionary as well I would like to get the value of that dictionary so the owner column would have 178413540
import numpy as np
import pandas as pd
stories_data = {'caption': 'Tel_gusto', 'like_count': 0, 'owner': {'id': '178413540'}, 'headers': {'Content-Encoding': 'gzip'}
x = pd.DataFrame(stories_data.items())
x.set_index(0, inplace=True)
stories_metric_df = x.transpose()
del stories_metric_df['headers']
I've tried this but it gets the key not the value
stories_metric_df['owner'].explode().apply(pd.Series)
You can use .str, even for objects/dicts:
stories_metric_df['owner'] = stories_metric_df['owner'].str['id']
Output:
>>> stories_metric_df
0 caption like_count owner
1 Tel_gusto 0 178413540
Another solution would be to skip the explode, and just extract id:
stories_metric_df['owner'].apply(pd.Series)['id']
although I suspect my first solution would be faster.

Pandas print unique values as string

I've got a list of unique value from selected column in pandas dataframe. What I want to achieve is to print the result as string.
import pandas as pd
df = pd.DataFrame({'A':['A','C','C','B','A','C','B']})
a = df['A'].unique()
print(a)
Output: ['A' 'C' 'B']
Desired output: A, C, B
So far I've tried below,
print(a.to_string())
Got this error: AttributeError: 'numpy.ndarray' object has no attribute 'to_string'
print(a.tostring())
Got this: b'\xf0\x04\xa6P\x9e\x01\x00\x000\xaf\x92P\x9e\x01\x00\x00\xb0\xaf\x92P\x9e\x01\x00\x00'
Can anyone give a hint.
import pandas as pd
df = pd.DataFrame({'A':['A','C','C','B','A','C','B']})
a = df['A'].unique()
print(', '.join(a)) # or print(*a, sep=', ')
Prints:
A, C, B
EDIT: To store as variable:
text = ', '.join(a)
print(text)
This should work:
print(', '.join(a))
py3 solution
df = pd.DataFrame({'A':['A','C','C','B','A','C','B']})
a = df['A'].unique()
print(*a, sep=", ")

why np.std() and pivot_table(aggfunc=np.std) return the different result

I have some code and do not understand why the difference occurs:
np.std() which default ddof=0,when it's used alone.
but why when it's used as an argument in pivot_table(aggfunc=np.std),it changes into ddof=1 automatically.
import numpys as np
import pandas as pd
dft = pd.DataFrame({'A': ['one', 'one'],
'B': ['A', 'A'],
'C': ['bar', 'bar'],
'D': [-0.866740402,1.490732028]})
np.std(dft['D'])
#equivalent:np.std([-0.866740402,1.490732028]) (which:defaualt ddof=0)
#the result: 1.178736215
dft.pivot_table(index=['A', 'B'],columns='C',aggfunc=np.std)
#equivalent:np.std([-0.866740402,1.490732028],ddof=1)
#the result:1.666985
pivot uses DataFrame.groupby.agg and when you supply an aggregation function it's going to try to figure out exactly how to _aggregate.
arg=np.std will get handled here, the relevant code being
f = self._get_cython_func(arg)
if f and not args and not kwargs:
return getattr(self, f)(), None
Hidden in the DataFrame class is this table:
pd.DataFrame()._cython_table
#OrderedDict([(<function sum>, 'sum'),
# (<function max>, 'max'),
# ...
# (<function numpy.std>, 'std'),
# (<function numpy.nancumsum>, 'cumsum')])
pd.DataFrame()._cython_table.get(np.std)
#'std'
And so np.std is only used to select the attribute to call, the default ddof are completely ignored, and instead the pandas default of ddof=1 is used.
getattr(dft['D'], 'std')()
#1.6669847417133286

'NA' handling in python pandas

i have a dataframe with name,age fieldname,name column has missing value and NA when i read the value using pd.read_excel,missing value and NA become NaN,how can i avoid this issue.
this is my code
import pandas as pd
data = {'Name':['Tom', '', 'NA','', 'Ricky',"NA",''],'Age':[28,34,29,42,35,33,40]}
df = pd.DataFrame(data)
df.to_excel("test1.xlsx",sheet_name="test")
import pandas as pd
data=pd.read_excel("./test1.xlsx")
To avoid this, just set the keep_default_na to False:
df = pd.read_excel('test1.xlsx', keep_default_na=False)

pandas SettingWithCopyWarning only inside function

With a dataframe like
import pandas as pd
df = pd.DataFrame(
["2017-01-01 04:45:00", "2017-01-01 04:45:00removeMe"], columns=["col"]
)
why do I get a SettingWithCopyWarning here
def test_fun(df):
df = df[~df["col"].str.endswith("removeMe")]
df.loc[:, "col"] = pd.to_datetime(df["col"])
return df
df = test_fun(df)
but not if I run it without the function?
df = df[~df["col"].str.endswith("removeMe")]
df.loc[:, "col"] = pd.to_datetime(df["col"])
And how is my function supposed to look like?
In the function, you have df, which when you index it with your boolean array, gives a view of the outside-scope df - then you're trying to additionally index that view, which is why the warning comes in. Without the function, df is just a dataframe that's resized with your index instead (it's not a view).
I would write it as this instead either way:
df["col"] = pd.to_datetime(df["col"], errors='coerce')
return df[~pd.isna(df["col"])]
Found the trick:
def test_fun(df):
df.loc[:] = df[~df["col"].str.endswith("removeMe")] <------- I added the `.loc[:]`
df.loc[:, "col"] = pd.to_datetime(df["col"])
return df
Don't do df = ... in the function.
Instead do df.loc[:] = ... !

Resources