Python pandas DataFrame column insert call

Python pandas DataFrame column insert call - python-3.x

I have a simple txt file and I am reading it as follows:
data=pd.read_csv("data1.txt",sep=',', header = None)
data.columns=['X1', 'Y']
when I print this I get:-
X1 Y
0 6.1101 17.5920
1 5.5277 9.1302
2 8.5186 13.6620
3 7.0032 11.8540
4 5.8598 6.8233
Now I want to insert a Column X0 in front of X1 ( to its left) and give this column a value of 1.so I added this code:-
data = data.insert(0,'X0',1)
print(type(data))
print(len(data))
But I get the following error message:-
<class 'NoneType'>
TypeError: object of type 'NoneType' has no len()
The question is , is my data.insert correct?. why is that type of the dataframe coming as NoneType. what am I doing wrong here?.

Instead of using insert which acts in place, you can use assign
data = data.assign(X0=1)[['X0'] + data.columns.tolist()]
print(data)
X0 X1 Y
0 1 6.1101 17.5920
1 1 5.5277 9.1302
2 1 8.5186 13.6620
3 1 7.0032 11.8540
4 1 5.8598 6.8233

You cannot assign DataFrame.insert to new DataFrame, because it works inplace:
data.insert(0,'X0',1)
print (data)
X0 X1 Y
0 1 6.1101 17.5920
1 1 5.5277 9.1302
2 1 8.5186 13.6620
3 1 7.0032 11.8540
4 1 5.8598 6.8233

Related

Pandas .describe() returns wrong column values in table

Look at the gld_weight column of figure 1. It is throwing off completely wrong values. The btc_weight + gld_weight should always adds up to 1. But why is the gld_weight column not corresponding to the returned row values when I used the describe function?
Figure 1:
Figure 2:
Figure 3:
This is my source code:
import numpy as np
import pandas as pd
from pandas_datareader import data as wb
import matplotlib.pyplot as plt
assets = ['BTC-USD', 'GLD']
mydata = pd.DataFrame()
for asset in assets:
mydata[asset] = wb.DataReader(asset, data_source='yahoo', start='2015-1-1')['Close']
cleandata = mydata.dropna()
log_returns = np.log(cleandata/cleandata.shift(1))
annual_log_returns = log_returns.mean() * 252 * 100
annual_log_returns
annual_cov = log_returns.cov() * 252
annual_cov
pfolio_returns = []
pfolio_volatility = []
btc_weight = []
gld_weight = []
for x in range(1000):
weights = np.random.random(2)
weights[0] = weights[0]/np.sum(weights)
weights[1] = weights[1]/np.sum(weights)
weights /= np.sum(weights)
btc_weight.append(weights[0])
gld_weight.append(weights[1])
pfolio_returns.append(np.dot(annual_log_returns, weights))
pfolio_volatility.append(np.sqrt(np.dot(weights.T, np.dot(annual_cov, weights))))
pfolio_returns
pfolio_volatility
npfolio_returns = np.array(pfolio_returns)
npfolio_volatility = np.array(pfolio_volatility)
new_portfolio = pd.DataFrame({
'Returns': npfolio_returns,
'Volatility': npfolio_volatility,
'btc_weight': btc_weight,
'gld_weight': gld_weight
})

I'am not 100% sure i got your question correctly, but an issue might be, that you are not reassigning the output to new variable, therefore not saving it.
Try to adjust your code in this matter:
new_portfolio = new_portfolio.sort_values(by="Returns")
Or turn inplace parameter to True - link

Short answer :
The issue at hand was found in the for-loop were the initial weight value normalization was done. How its fixed: see update 1 below in the answer.
Background to getting the solution:
At first glance the code of OP seemed to be in order and values in the arrays were fitted as expected by the requests OP made via the written codes. From testing it appeared that with range(1000) was asking for trouble because value-outcome oversight was lost due to the vast amount of "randomness" results. Especially as the question was written as a transformation issue. So x/y axis values mixing or some other kind of transformation error was hard to study.
To tackle this I used static values as can be seen for annual_log_returns and annual_cov.
Then I've locked all outputs for print so the values become locked in place and can't be changed further down the processing. .. it was possible that the prints of code changed during run-time because the arrays were not locked (also suggested by Pavel Klammert in his answer).
After commented feedback I've figured out what OP meant with "the values are wrong. I then focused on the method how the used values, to fill the arrays, were created.
The issue of "throwing wrong values was found :
The use of weights[0] = weights[0]/np.sum(weights) replaces the original list weights[0] value for new weights[0] which then serves as new input for weights[1] = weights[1]/np.sum(weights) and therefore sum = 1 is never reached.
The variable names weights[0] and weights[1] were then changed into 'a' and 'b' at two places directly after the creation of weights [0] and [1] values to prevent overwriting the initial weights values. Then the outcome is as "planned".
Problem solved.
import numpy as np
import pandas as pd
pfolio_returns = []
pfolio_volatility = []
btc_weight = []
gld_weight = []
annual_log_returns = [0.69, 0.71]
annual_cov = 0.73
ranger = 5
for x in range(ranger):
weights = np.random.random(2)
weights[0] = weights[0]/np.sum(weights)
weights[1] = weights[1]/np.sum(weights)
weights /= np.sum(weights)
btc_weight.append(weights[0])
gld_weight.append(weights[1])
pfolio_returns.append(np.dot(annual_log_returns, weights))
pfolio_volatility.append(np.sqrt(np.dot(weights.T, np.dot(annual_cov, weights))))
print (weights[0])
print (weights[1])
print (weights)
#print (pfolio_returns)
#print (pfolio_volatility)
npfolio_returns = np.array(pfolio_returns)
npfolio_volatility = np.array(pfolio_volatility)
#df = pd.DataFrame(array, index = row_names, columns=colomn_names, dtype = dtype)
new_portfolio = pd.DataFrame({'Returns': npfolio_returns, 'Volatility': npfolio_volatility, 'btc_weight': btc_weight, 'gld_weight': gld_weight})
print (new_portfolio, '\n')
sort = new_portfolio.sort_values(by='Returns')
sort_max_gld_weight = sort.loc[ranger-1, 'gld_weight']
print ('Sort:\n', sort, '\n')
print ('sort max_gld_weight : "%s"\n' % sort_max_gld_weight) # if "999" contains the highest gld_weight... but most cases its not!
sort_max_gld_weight = sort.max(axis=0)[3] # this returns colomn 4 'gld_weight' value.
print ('sort max_gld_weight : "%s"\n' % sort_max_gld_weight) # this returns colomn 4 'gld_weight' value.
desc = new_portfolio.describe()
desc_max_gld_weight =desc.loc['max', 'gld_weight']
print ('Describe:\n', desc, '\n')
print ('desc max_gld_weight : "%s"\n' % desc_max_gld_weight)
max_val_gld = new_portfolio.loc[new_portfolio['gld_weight'] == sort_max_gld_weight]
print('max val gld:\n', max_val_gld, '\n')
locations = new_portfolio.loc[new_portfolio['gld_weight'] > 0.99]
print ('location:\n', locations)
Result can be for example:
0.9779586087178525
0.02204139128214753
[0.97795861 0.02204139]
Returns Volatility btc_weight gld_weight
0 0.702820 0.627707 0.359024 0.640976
1 0.709807 0.846179 0.009670 0.990330
2 0.708724 0.801756 0.063786 0.936214
3 0.702010 0.616237 0.399496 0.600504
4 0.690441 0.835780 0.977959 0.022041
Sort:
Returns Volatility btc_weight gld_weight
4 0.690441 0.835780 0.977959 0.022041
3 0.702010 0.616237 0.399496 0.600504
0 0.702820 0.627707 0.359024 0.640976
2 0.708724 0.801756 0.063786 0.936214
1 0.709807 0.846179 0.009670 0.990330
sort max_gld_weight : "0.02204139128214753"
sort max_gld_weight : "0.9903300366638084"
Describe:
Returns Volatility btc_weight gld_weight
count 5.000000 5.000000 5.000000 5.000000
mean 0.702760 0.745532 0.361987 0.638013
std 0.007706 0.114057 0.385321 0.385321
min 0.690441 0.616237 0.009670 0.022041
25% 0.702010 0.627707 0.063786 0.600504
50% 0.702820 0.801756 0.359024 0.640976
75% 0.708724 0.835780 0.399496 0.936214
max 0.709807 0.846179 0.977959 0.990330
desc max_gld_weight : "0.9903300366638084"
max val gld:
Returns Volatility btc_weight gld_weight
1 0.709807 0.846179 0.00967 0.99033
loacation:
Returns Volatility btc_weight gld_weight
1 0.709807 0.846179 0.00967 0.99033
Update 1 :
for x in range(ranger):
weights = np.random.random(2)
print (weights)
a = weights[0]/np.sum(weights) # see comments below.
print (weights[0])
b = weights[1]/np.sum(weights) # see comments below.
print (weights[1])
print ('w0 + w1=', weights[0] + weights[1])
weights /= np.sum(weights)
btc_weight.append(a)
gld_weight.append(b)
print('a=', a, 'b=',b , 'a+b=', a+b)
The new output becomes for example:
[0.37710183 0.72933416]
0.3771018292953062
0.7293341569809412
w0 + w1= 1.1064359862762474
a= 0.34082570882790686 b= 0.6591742911720931 a+b= 1.0
[0.09301326 0.05296838]
0.09301326441107827
0.05296838430180717
w0 + w1= 0.14598164871288544
a= 0.637157240181712 b= 0.3628427598182879 a+b= 1.0
[0.48501305 0.56078073]
0.48501305100305336
0.5607807281299131
w0 + w1= 1.0457937791329663
a= 0.46377503928658087 b= 0.5362249607134192 a+b= 1.0
[0.41271663 0.89734662]
0.4127166254704412
0.8973466186511199
w0 + w1= 1.3100632441215612
a= 0.31503564986069105 b= 0.6849643501393089 a+b= 1.0
[0.11854074 0.57862593]
0.11854073835784273
0.5786259314340823
w0 + w1= 0.697166669791925
a= 0.1700321364950252 b= 0.8299678635049749 a+b= 1.0
Results printed outside the for-loop:
0.1700321364950252
0.8299678635049749
[0.17003214 0.82996786]

Issue in passing an array to an index in Series object(TypeError: len() of unsized object)

I have a data as ndarray
a = [0,1,0,1,2,2]
b = ['x1','x2','x3','x4','x5','x6']
Then I tried:
univals = set(a)
serObj=pd.Series()
for ele in univals:
indexfound=np.where(a == ele)
Xpointsfromindex=np.take(b, indexfound)
serobj1=pd.Series(Xpointsfromindex[0],index=ele) ##error happening here
serObj.apend(serobj1)
print(serObj)
I expect output to be like
0 ['x1','x3']
1 ['x2','x4']
2 ['x5','x6']
But it is giving me an error like "TypeError: len() of unsized object"
Where am I doing wrong?

I believe here is possible create DataFrame if same length of lists and then create lists with groupby:
a = [0,1,0,1,2,2]
b = ['x1','x2','x3','x4','x5','x6']
df = pd.DataFrame({'a':a, 'b':b})
print(df)
a b
0 0 x1
1 1 x2
2 0 x3
3 1 x4
4 2 x5
5 2 x6
serObj = df.groupby('a')['b'].apply(list)
print (serObj)
a
0 [x1, x3]
1 [x2, x4]
2 [x5, x6]
Name: b, dtype: object

Just to stick to what OP was doing, here is the full code that works -
import pandas as pd
import numpy as np
a = [0,1,0,1,2,2]
b = ['x1','x2','x3','x4','x5','x6']
univals = set(a)
serObj=pd.Series()
for ele in univals:
indexfound=np.where([i==ele for i in a])
Xpointsfromindex=np.take(b, indexfound)
print(Xpointsfromindex)
serobj1=pd.Series(Xpointsfromindex[0],index=[ele for _ in range(np.shape(indexfound)[1])]) ##error happening here
serObj.append(serobj1)
print(serObj)
Output
[['x1' 'x3']]
[['x2' 'x4']]
[['x5' 'x6']]
Explanation
indexfound=np.where(a == ele) will always return False because you are trying to compare a list with a scalar. Changing it to list comprehension fetches the indices
The next change is using list comprehension at the index parameter of the pd.Series.
This will set you on your way to what you want to achieve

passing parameters in groupby aggregate function

I have dataframe which I've referenced as df in the code and I'm applying aggregate functions on multiple columns of each group. I also applied user-defined lambda functions f4, f5, f6, f7. Some functions are very similar like f4, f6 and f7 where only parameter value are different. Can I pass these parameters from dictionary d, so that I have to write only one function instead of writing multiple functions?
f4 = lambda x: len(x[x>10]) # count the frequency of bearing greater than threshold value
f4.__name__ = 'Frequency'
f5 = lambda x: len(x[x<3.4]) # count the stop points with velocity less than threshold value 3.4
f5.__name__ = 'stop_frequency'
f6 = lambda x: len(x[x>0.2]) # count the points with velocity greater than threshold value 0.2
f6.__name__ = 'frequency'
f7 = lambda x: len(x[x>0.25]) # count the points with accelration greater than threshold value 0.25
f7.__name__ = 'frequency'
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f5, 'sum' ,'count', 'median', 'min'],
'velocity_rate':f6,
'acc_rate':f7,
'bearing':['sum', f4],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
I like to write a function like
f4(p) = lambda x: len(x[x>p])
f4.__name__ = 'Frequency'
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f5, 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2),
'acc_rate':f4(0.25),
'bearing':['sum', f4(10)],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
The csv file of dataframe df is available at given link for more clarity of data.
https://drive.google.com/open?id=1R_BBL00G_Dlo-6yrovYJp5zEYLwlMPi9

It is possible, but not easy, solution by neilaronson.
Also solution is simplify by sum of True values of boolean mask.
def f4(p):
def ipf(x):
return (x < p).sum()
#your solution
#return len(x[x < p])
ipf.__name__ = 'Frequency'
return ipf
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f4(3.4), 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2),
'acc_rate':f4(.25),
'bearing':['sum', f4(10)],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
EDIT: You can also pass parameter for greater or less:
def f4(p, op):
def ipf(x):
if op == 'greater':
return (x > p).sum()
elif op == 'less':
return (x < p).sum()
else:
raise ValueError("second argument has to be greater or less only")
ipf.__name__ = 'Frequency'
return ipf
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f4(3.4, 'less'), 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2, 'greater'),
'acc_rate':f4(.25, 'greater'),
'bearing':['sum', f4(10, 'greater')],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
print (df1.head())
userid trip_id segmentid Transportation_Mode acceleration_mean \
0 141 1.0 1 walk 0.061083
1 141 2.0 1 walk 0.109148
2 141 3.0 1 walk 0.106771
3 141 4.0 1 walk 0.141180
4 141 5.0 1 walk 1.147157
acceleration_median acceleration_min velocity_Frequency velocity_sum \
0 -1.168583e-02 -2.994428 1000.0 1506.679506
1 1.665535e-09 -3.234188 464.0 712.429005
2 -3.055414e-08 -3.131293 996.0 1394.746071
3 9.241707e-09 -3.307262 340.0 513.461259
4 -2.609489e-02 -3.190424 493.0 729.702854
velocity_count velocity_median velocity_min velocity_rate_Frequency \
0 1028 1.294657 0.284747 288.0
1 486 1.189650 0.284725 134.0
2 1020 1.241419 0.284733 301.0
3 352 1.326324 0.339590 93.0
4 504 1.247868 0.284740 168.0
acc_rate_Frequency bearing_sum bearing_Frequency bearing_rate_sum \
0 169.0 81604.187066 884.0 -371.276356
1 89.0 25559.589869 313.0 -357.869944
2 203.0 -71540.141199 57.0 946.382581
3 78.0 9548.920765 167.0 -943.184805
4 93.0 -24021.555784 67.0 535.333624
Vincenty_distance_sum
0 1506.679506
1 712.429005
2 1395.328768
3 513.461259
4 731.823664

how to replace a cell in a pandas dataframe

After forming the below python pandas dataframe (for example)
import pandas
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pandas.DataFrame(data,columns=['Name','Age'])
If I iterate through it, I get
In [62]: for i in df.itertuples():
...: print( i.Index, i.Name, i.Age )
...:
0 Alex 10
1 Bob 12
2 Clarke 13
What I would like to achieve is to replace the value of a particular cell
In [67]: for i in df.itertuples():
...: if i.Name == "Alex":
...: df.at[i.Index, 'Age'] = 100
...:
Which seems to work
In [64]: df
Out[64]:
Name Age
0 Alex 100
1 Bob 12
2 Clarke 13
The problem is that when using a larger different dataset, and do:
First, I create a new column named like NETELEMENT with a default value of ""
I would like to replace the default value "" with the string that the function lookup_netelement returns
df['NETELEMENT'] = ""
for i in df.itertuples():
df.at[i.Index, 'NETELEMENT'] = lookup_netelement(i.PEER_SRC_IP)
print( i, lookup_netelement(i.PEER_SRC_IP) )
But what I get as a result is:
Pandas(Index=769, SRC_AS='', DST_AS='', COMMS='', SRC_COMMS=nan, AS_PATH='', SRC_AS_PATH=nan, PREF='', SRC_PREF='0', MED='0', SRC_MED='0', PEER_SRC_AS='0', PEER_DST_AS='', PEER_SRC_IP='x.x.x.x', PEER_DST_IP='', IN_IFACE='', OUT_IFACE='', PROTOCOL='udp', TOS='0', BPS=35200.0, SRC_PREFIX='', DST_PREFIX='', NETELEMENT='', IN_IFNAME='', OUT_IFNAME='') routerX
meaning that it should be:
NETELEMENT='routerX' instead of NETELEMENT=''
Could you please advise what I am doing wrong ?
EDIT: for reasons of completeness the lookup_netelement is defined as
def lookup_netelement(ipaddr):
try:
x = LOOKUP['conn'].hget('ipaddr;{}'.format(ipaddr), 'dev') or b""
except:
logger.error('looking up `ipaddr` for netelement caused `{}`'.format(repr(e)), exc_info=True)
x = b""
x = x.decode("utf-8")
return x

Hope you are looking for where for conditional replacement i.e
def wow(x):
return x ** 10
df['new'] = df['Age'].where(~(df['Name'] == 'Alex'),wow(df['Age']))
Output :
Name Age new
0 Alex 10 10000000000
1 Bob 12 12
2 Clarke 13 13
3 Alex 15 576650390625
Based on your edit your trying to apply the function i.e
df['new'] = df['PEER_SRC_IP'].apply(lookup_netelement)
Edit : For your comment on sending two columns, use lambda with axis 1 i.e
def wow(x,y):
return '{} {}'.format(x,y)
df.apply(lambda x : wow(x['Name'],x['Age']),1)

TypeError: unhashable type: 'Int64Index'

The section of my code that is causing me problems is
def Half_Increase(self):
self.keg_count=summer17.iloc[self.result_rows,2].values[0]
self.keg_count +=1
summer17[self.result_rows,2] = self.keg_count
print(keg_count)
So this function is to be executed when a button widget is pressed. It's supposed to get the value from a specific cell in a dataframe, add 1 to it, and then return the new value to the dataframe. (I'm not entirely sure if this is the proper way to do this.)
I get the following error
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Python3.6\lib\tkinter\__init__.py", line 1699, in __call__
return self.func(*args)
File "beerfest_program_v0.3.py", line 152, in Half_Increase
summer17[self.result_rows,2] = self.keg_count
File "C:\Python3.6\lib\site-packages\pandas\core\frame.py", line 2331, in __setitem__
self._set_item(key, value)
File "C:\Python3.6\lib\site-packages\pandas\core\frame.py", line 2397, in _set_item
value = self._sanitize_column(key, value)
File "C:\Python3.6\lib\site-packages\pandas\core\frame.py", line 2596, in _sanitize_column
if broadcast and key in self.columns and value.ndim == 1:
File "C:\Python3.6\lib\site-packages\pandas\core\indexes\base.py", line 1640, in __contains__
hash(key)
File "C:\Python3.6\lib\site-packages\pandas\core\indexes\base.py", line 1667, in __hash__
raise TypeError("unhashable type: %r" % type(self).__name__)
TypeError: unhashable type: 'Int64Index'
I'm guessing this has something to do with the variable types not matching but I've looked and cant find how to remedy this.

I think you need iloc:
summer17.iloc[result_rows,2] += 1
Sample:
summer17 = pd.DataFrame({'a':[1,2,3],
'b':[3,4,5],
'c':[5,9,7]})
#if reselt_rows is scalar
result_rows = 1
print(summer17)
a b c
0 1 3 5
1 2 4 9
2 3 5 7
summer17.iloc[result_rows,2] += 1
print(summer17)
a b c
0 1 3 5
1 2 4 10
2 3 5 7
It is same as:
#get value
keg_count=summer17.iloc[result_rows,2]
#increment
keg_count +=1
#set value
summer17.iloc[result_rows,2] = keg_count
print(summer17)
a b c
0 1 3 5
1 2 4 10
2 3 5 7
But if result_rows is list or 1d array:
result_rows = [1,2]
#get all values per positions defined in result_rows
#filter only first value by values[0]
keg_count=summer17.iloc[result_rows,2].values[0]
#increment
keg_count +=1
#set all values of result_rows by incremented value
summer17.iloc[result_rows,2] = keg_count
print(summer17)
a b c
0 1 3 5
1 2 4 10
2 3 5 10

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python pandas DataFrame column insert call - python-3.x

Instead of using insert which acts in place, you can use assign data = data.assign(X0=1)[['X0'] + data.columns.tolist()] print(data) X0 X1 Y 0 1 6.1101 17.5920 1 1 5.5277 9.1302 2 1 8.5186 13.6620 3 1 7.0032 11.8540 4 1 5.8598 6.8233

You cannot assign DataFrame.insert to new DataFrame, because it works inplace: data.insert(0,'X0',1) print (data) X0 X1 Y 0 1 6.1101 17.5920 1 1 5.5277 9.1302 2 1 8.5186 13.6620 3 1 7.0032 11.8540 4 1 5.8598 6.8233

Related

Pandas .describe() returns wrong column values in table

Issue in passing an array to an index in Series object(TypeError: len() of unsized object)

passing parameters in groupby aggregate function

how to replace a cell in a pandas dataframe

TypeError: unhashable type: 'Int64Index'

Categories

Resources