I would like to Vectorization my dataframe with NumPy arrays but I got an error
Here is the code :
Here I initialize my dataframe
df2 = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20130102'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': [True,False,True,False],
'E': pd.Categorical(["test", "Draft", "test", "Draft"]),
'F': 'foo'})
df2
output:
A B C D E F
0 1.0 2013-01-02 1.0 True test foo
1 1.0 2013-01-02 1.0 False Draft foo
2 1.0 2013-01-02 1.0 True test foo
3 1.0 2013-01-02 1.0 False train foo
Here I define the function to apply to dataframe's columns
def IsBillingValid2(xE,yBilling):
if(xE not in ['Draft','Cancelled'] and yBilling==True): #Order Edited
return True
else:
return False
Here I launch my function
df2['BillingPostalCode_Det_StageName_Det']=IsBillingValid(df2['E'].values,df2['D'].values)
Here is the Error:
output:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<command-2946139111570059> in <module>
16 return False
17
---> 18 df2['BillingPostalCode_Det_StageName_Det']=IsBillingValid(df2['E'].values,df2['D'].values)
19
<command-2041881674588848> in IsBillingValid(xStageName, yBilling)
207 def IsBillingValid(xStageName,yBilling):
208
--> 209 if(xStageName not in ['Draft','Cancelled'] and yBilling==True): #Order Edited
210 return True
211
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Thanks for your help
You don't need apply especially when you want vectorized operation.
Use pandas.Series.isin:
df2['BillingPostalCode_Det_StageName_Det'] = ~df2["E"].isin({"Draft", "Cancelled"}) & df2["D"]
print(df2)
Output:
A B C D E F BillingPostalCode_Det_StageName_Det
0 1.0 2013-01-02 1.0 True test foo True
1 1.0 2013-01-02 1.0 False Draft foo False
2 1.0 2013-01-02 1.0 True test foo True
3 1.0 2013-01-02 1.0 False Draft foo False
Related
I have a dataframe which looks like this:
date symbol numerator denominator
4522 2021-10-06 PAG.SG 1.0 18
1016 2020-11-23 IPA.V 1.0 5
412 2020-04-17 LRK.AX 1.0 30
1884 2021-06-03 BOUVETO.ST 1.0 1
2504 2021-04-28 VKGYO.IS 1.0 100
3523 2021-07-08 603355.SS 1.0 1
3195 2021-08-23 IDAI 1.0 1
3238 2021-08-19 6690.TWO 1.0 1000
3430 2021-07-19 CAXPD 1.0 10
2642 2021-04-15 035720.KS 1.0 1
dtypes:
date: object
symbol: object
numerator: float64
denominator: int64
I am trying to use pd.assign to assign a classifier to this df in the form of
df = df.assign(category = ['forward' if numerator > denominator else 'reverse' for numerator in df[['numerator', 'denominator']]])
But I'm receiving a TypeError stating:
TypeError: Invalid comparison between dtype=int64 and str
I have tried casting them explicitly, with:
df = df.assign(category = ['forward' if df['numerator'] > df['denominator'] else 'reverse' for df['numerator'] in df])
But receive another TypeError stating:
TypeError: '>' not supported between instances of 'str' and 'int'
Which is confusing because I'm not comparing strings, I'm comparing int and float.
Any help would be greatly appreciated.
You still can do that with np.where
import numpy as np
df = df.assign(category = np.where(df['numerator']>df['denominator'],
'forward',
'reverse')
I have the following code and output
mean = dataframe.groupby('LABEL')['RESP'].mean()
minimum = dataframe.groupby('LABEL')['RESP'].min()
maximum = dataframe.groupby('LABEL')['RESP'].max()
std = dataframe.groupby('LABEL')['RESP'].std()
df = [mean, minimum, maximum]
And the following output
[LABEL
0.0 -1.193420
1.0 0.713425
2.0 -1.066513
3.0 -0.530640
4.0 -2.130600
6.0 0.084747
7.0 1.190506
Name: RESP, dtype: float64,
LABEL
0.0 -1.396179
1.0 -0.233459
2.0 -1.631165
3.0 -1.271057
4.0 -2.543640
6.0 -0.418091
7.0 -0.004578
Name: RESP, dtype: float64,
LABEL
0.0 0.042247
1.0 0.295534
2.0 0.128233
3.0 0.243975
4.0 0.088077
6.0 0.085615
7.0 0.693196
Name: RESP, dtype: float64
]
However I want the output to be a dictionary as
{label_value: [mean, min, max, std_dev]}
For example
{1: [1, 0, 2, 1], 2: [0, -1, 1, 1], ... }
I'm assuming your starting Dataframe is equivalent to one I've synthesised.
calculate all of the aggregate values in one call to aggregate. rounded values so output fits in this answer
reset_index() on aggregate then to_dict()
list comprehension to reformat dict to your specification
df = pd.DataFrame([[l, random.random()] for l in range(8) for k in range(500)], columns=["LABEL","RESP"])
d = df.groupby("LABEL")["RESP"].agg([np.mean, np.min, np.max, np.std]).round(4).reset_index().to_dict(orient="records")
{e["LABEL"]:[e["mean"],e["amin"],e["amax"],e["std"]] for e in d}
output
{0: [0.5007, 0.0029, 0.997, 0.2842],
1: [0.4967, 0.0001, 0.9993, 0.2855],
2: [0.4742, 0.0003, 0.9931, 0.2799],
3: [0.5175, 0.0062, 0.9996, 0.2978],
4: [0.4909, 0.0018, 0.9952, 0.2912],
5: [0.4787, 0.0077, 0.9976, 0.291],
6: [0.4878, 0.0009, 0.9942, 0.2806],
7: [0.4989, 0.0066, 0.9982, 0.278]}
I have this dataframe:
bprice cprice strike irate tleft
0 26436.90 91.70 26400 10 0.5
1 26423.00 98.90 26400 10 0.5
2 26436.00 90.00 26400 10 0.5
3 26416.35 103.60 26400 10 0.5
I am applying one function ( http://code.mibian.net/ )
c = mibian.BS([1.4565, 1.45, 1, 30], callPrice=0.0359) on my dataframe,
df2['IV'] = df2.apply(lambda df: mb.BS([df2['bprice'],26400,10,df2['tleft']],
callPrice=df2['cprice']).impliedVolatility, axis=1)
But I am getting the error
TypeError: ("cannot convert the series to <class 'float'>", 'occurred at index 0')
I suggest you wrap the function an do something like this -
def my_func(x):
return mb.BS([x['bprice'], 26400, 10 x['tleft']],
callPrice=x['cprice']).impliedVolatility
df2['IV'] = df2.apply(my_func, axis=1)
In the way you implemented in your code, you actually called mb.BS with columns (pd.Series) from the dataframe. See pandas apply documentation here.
This is what I do:
import pandas as pd
t = pd.DataFrame(data={'i1':[0,0,0,0,1,1,1,1,2,2,2,2],
'i2':[0,1,2,3,0,1,2,3,0,1,2,3],
'x':[1.,2.,3.,4.,5.,6.,7.,8.,9.,10.,11.,12.]})
t.set_index(['i1','i2'], inplace=True)
t.sort_index(inplace=True)
t.index.values.searchsorted( (1,1) )
This is the error I get:
> Traceback (most recent call last): File "<stdin>", line 1, in
> <module> TypeError: '<' not supported between instances of 'tuple' and
> 'int'
Please, help me to understand what I am doing wrong.
Index values are tuples: type(t.index.values[0]) gives correctly <class 'tuple'> and I give as input to searchsorted a tuple. So where does the 'tuple' to 'int' comparison come from?
>>> print(t)
x
i1 i2
0 0 1.0
1 2.0
2 3.0
3 4.0
1 0 5.0
1 6.0
2 7.0
3 8.0
2 0 9.0
1 10.0
2 11.0
3 12.0
searchsorted doesn't work with tuples. There's an open issue on github for Multiarray searchsorted fails
On that issue, one of the participants suggests using get_indexer
With your code
t.index.get_indexer([(1,1)])[0]
# outputs:
5
I have found a solution:
>>> t.index.get_loc( (1,1) )
5
This solution is ~200 times faster that using t.index.get_indexer:
>>> import time
>>> time.clock()
168.56
>>> for i in range(10000): a = t.index.get_indexer([(1,1)])[0]
...
>>> time.clock()
176.76
>>> (176.76 - 168.56) / 10000
0.0008199999999999989 # 820e-6 sec per call
>>> time.clock()
176.76
>>> for i in range(1000000): a = t.index.get_loc( (1,1) )
...
>>> time.clock()
180.94
>>> (180.94-176.76)/1000000
4.1800000000000066e-06 # 4.2e-6 sec per call
Here is what I am trying to do:
In [7]: from pandas import DataFrame, Series
In [8]: import pandas as pd
In [9]: import numpy as np
In [10]: df = DataFrame([[1.4, np.nan], [7.1, -4.5],
[np.nan, np.nan], [0.75, -1.3]],
index=['a', 'b', 'c', 'd'],
columns=['one', 'two'])
Out[10]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
In [11]: df.idxmax()
Out[11]:
one b
two d
dtype: object
In [12]: df[df.idxmax()] = -9.99
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-12-018b077daf48> in <module>()
----> 1 df[df.idxmax()] = -9.99
/usr/local/lib/python3.4/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
2103
2104 if isinstance(key, (Series, np.ndarray, list, Index)):
-> 2105 self._setitem_array(key, value)
2106 elif isinstance(key, DataFrame):
2107 self._setitem_frame(key, value)
/usr/local/lib/python3.4/site-packages/pandas/core/frame.py in _setitem_array(self, key, value)
2131 self[k1] = value[k2]
2132 else:
-> 2133 indexer = self.ix._convert_to_indexer(key, axis=1)
2134 self._check_setitem_copy()
2135 self.ix._setitem_with_indexer((slice(None), indexer), value)
/usr/local/lib/python3.4/site-packages/pandas/core/indexing.py in _convert_to_indexer(self, obj, axis, is_setter)
1141 if isinstance(obj, tuple) and is_setter:
1142 return {'key': obj}
-> 1143 raise KeyError('%s not in index' % objarr[mask])
1144
1145 return _values_from_object(indexer)
KeyError: "['b' 'd'] not in index"
Intuitively this should work, but it doesn't. Any workarounds?
You should iterate over the series and access the index and col name to set the values:
In [30]:
for items in df.idxmax().iteritems():
print(items)
df.loc[items[1], items[0]] = -9.9
df
('one', 'b')
('two', 'd')
Out[30]:
one two
a 1.40 NaN
b -9.90 -4.5
c NaN NaN
d 0.75 -9.9
I've printed the items to show what the contents are