Pandas .describe() returns wrong column values in table

Pandas .describe() returns wrong column values in table - python-3.x

Look at the gld_weight column of figure 1. It is throwing off completely wrong values. The btc_weight + gld_weight should always adds up to 1. But why is the gld_weight column not corresponding to the returned row values when I used the describe function?
Figure 1:
Figure 2:
Figure 3:
This is my source code:
import numpy as np
import pandas as pd
from pandas_datareader import data as wb
import matplotlib.pyplot as plt
assets = ['BTC-USD', 'GLD']
mydata = pd.DataFrame()
for asset in assets:
mydata[asset] = wb.DataReader(asset, data_source='yahoo', start='2015-1-1')['Close']
cleandata = mydata.dropna()
log_returns = np.log(cleandata/cleandata.shift(1))
annual_log_returns = log_returns.mean() * 252 * 100
annual_log_returns
annual_cov = log_returns.cov() * 252
annual_cov
pfolio_returns = []
pfolio_volatility = []
btc_weight = []
gld_weight = []
for x in range(1000):
weights = np.random.random(2)
weights[0] = weights[0]/np.sum(weights)
weights[1] = weights[1]/np.sum(weights)
weights /= np.sum(weights)
btc_weight.append(weights[0])
gld_weight.append(weights[1])
pfolio_returns.append(np.dot(annual_log_returns, weights))
pfolio_volatility.append(np.sqrt(np.dot(weights.T, np.dot(annual_cov, weights))))
pfolio_returns
pfolio_volatility
npfolio_returns = np.array(pfolio_returns)
npfolio_volatility = np.array(pfolio_volatility)
new_portfolio = pd.DataFrame({
'Returns': npfolio_returns,
'Volatility': npfolio_volatility,
'btc_weight': btc_weight,
'gld_weight': gld_weight
})

I'am not 100% sure i got your question correctly, but an issue might be, that you are not reassigning the output to new variable, therefore not saving it.
Try to adjust your code in this matter:
new_portfolio = new_portfolio.sort_values(by="Returns")
Or turn inplace parameter to True - link

Short answer :
The issue at hand was found in the for-loop were the initial weight value normalization was done. How its fixed: see update 1 below in the answer.
Background to getting the solution:
At first glance the code of OP seemed to be in order and values in the arrays were fitted as expected by the requests OP made via the written codes. From testing it appeared that with range(1000) was asking for trouble because value-outcome oversight was lost due to the vast amount of "randomness" results. Especially as the question was written as a transformation issue. So x/y axis values mixing or some other kind of transformation error was hard to study.
To tackle this I used static values as can be seen for annual_log_returns and annual_cov.
Then I've locked all outputs for print so the values become locked in place and can't be changed further down the processing. .. it was possible that the prints of code changed during run-time because the arrays were not locked (also suggested by Pavel Klammert in his answer).
After commented feedback I've figured out what OP meant with "the values are wrong. I then focused on the method how the used values, to fill the arrays, were created.
The issue of "throwing wrong values was found :
The use of weights[0] = weights[0]/np.sum(weights) replaces the original list weights[0] value for new weights[0] which then serves as new input for weights[1] = weights[1]/np.sum(weights) and therefore sum = 1 is never reached.
The variable names weights[0] and weights[1] were then changed into 'a' and 'b' at two places directly after the creation of weights [0] and [1] values to prevent overwriting the initial weights values. Then the outcome is as "planned".
Problem solved.
import numpy as np
import pandas as pd
pfolio_returns = []
pfolio_volatility = []
btc_weight = []
gld_weight = []
annual_log_returns = [0.69, 0.71]
annual_cov = 0.73
ranger = 5
for x in range(ranger):
weights = np.random.random(2)
weights[0] = weights[0]/np.sum(weights)
weights[1] = weights[1]/np.sum(weights)
weights /= np.sum(weights)
btc_weight.append(weights[0])
gld_weight.append(weights[1])
pfolio_returns.append(np.dot(annual_log_returns, weights))
pfolio_volatility.append(np.sqrt(np.dot(weights.T, np.dot(annual_cov, weights))))
print (weights[0])
print (weights[1])
print (weights)
#print (pfolio_returns)
#print (pfolio_volatility)
npfolio_returns = np.array(pfolio_returns)
npfolio_volatility = np.array(pfolio_volatility)
#df = pd.DataFrame(array, index = row_names, columns=colomn_names, dtype = dtype)
new_portfolio = pd.DataFrame({'Returns': npfolio_returns, 'Volatility': npfolio_volatility, 'btc_weight': btc_weight, 'gld_weight': gld_weight})
print (new_portfolio, '\n')
sort = new_portfolio.sort_values(by='Returns')
sort_max_gld_weight = sort.loc[ranger-1, 'gld_weight']
print ('Sort:\n', sort, '\n')
print ('sort max_gld_weight : "%s"\n' % sort_max_gld_weight) # if "999" contains the highest gld_weight... but most cases its not!
sort_max_gld_weight = sort.max(axis=0)[3] # this returns colomn 4 'gld_weight' value.
print ('sort max_gld_weight : "%s"\n' % sort_max_gld_weight) # this returns colomn 4 'gld_weight' value.
desc = new_portfolio.describe()
desc_max_gld_weight =desc.loc['max', 'gld_weight']
print ('Describe:\n', desc, '\n')
print ('desc max_gld_weight : "%s"\n' % desc_max_gld_weight)
max_val_gld = new_portfolio.loc[new_portfolio['gld_weight'] == sort_max_gld_weight]
print('max val gld:\n', max_val_gld, '\n')
locations = new_portfolio.loc[new_portfolio['gld_weight'] > 0.99]
print ('location:\n', locations)
Result can be for example:
0.9779586087178525
0.02204139128214753
[0.97795861 0.02204139]
Returns Volatility btc_weight gld_weight
0 0.702820 0.627707 0.359024 0.640976
1 0.709807 0.846179 0.009670 0.990330
2 0.708724 0.801756 0.063786 0.936214
3 0.702010 0.616237 0.399496 0.600504
4 0.690441 0.835780 0.977959 0.022041
Sort:
Returns Volatility btc_weight gld_weight
4 0.690441 0.835780 0.977959 0.022041
3 0.702010 0.616237 0.399496 0.600504
0 0.702820 0.627707 0.359024 0.640976
2 0.708724 0.801756 0.063786 0.936214
1 0.709807 0.846179 0.009670 0.990330
sort max_gld_weight : "0.02204139128214753"
sort max_gld_weight : "0.9903300366638084"
Describe:
Returns Volatility btc_weight gld_weight
count 5.000000 5.000000 5.000000 5.000000
mean 0.702760 0.745532 0.361987 0.638013
std 0.007706 0.114057 0.385321 0.385321
min 0.690441 0.616237 0.009670 0.022041
25% 0.702010 0.627707 0.063786 0.600504
50% 0.702820 0.801756 0.359024 0.640976
75% 0.708724 0.835780 0.399496 0.936214
max 0.709807 0.846179 0.977959 0.990330
desc max_gld_weight : "0.9903300366638084"
max val gld:
Returns Volatility btc_weight gld_weight
1 0.709807 0.846179 0.00967 0.99033
loacation:
Returns Volatility btc_weight gld_weight
1 0.709807 0.846179 0.00967 0.99033
Update 1 :
for x in range(ranger):
weights = np.random.random(2)
print (weights)
a = weights[0]/np.sum(weights) # see comments below.
print (weights[0])
b = weights[1]/np.sum(weights) # see comments below.
print (weights[1])
print ('w0 + w1=', weights[0] + weights[1])
weights /= np.sum(weights)
btc_weight.append(a)
gld_weight.append(b)
print('a=', a, 'b=',b , 'a+b=', a+b)
The new output becomes for example:
[0.37710183 0.72933416]
0.3771018292953062
0.7293341569809412
w0 + w1= 1.1064359862762474
a= 0.34082570882790686 b= 0.6591742911720931 a+b= 1.0
[0.09301326 0.05296838]
0.09301326441107827
0.05296838430180717
w0 + w1= 0.14598164871288544
a= 0.637157240181712 b= 0.3628427598182879 a+b= 1.0
[0.48501305 0.56078073]
0.48501305100305336
0.5607807281299131
w0 + w1= 1.0457937791329663
a= 0.46377503928658087 b= 0.5362249607134192 a+b= 1.0
[0.41271663 0.89734662]
0.4127166254704412
0.8973466186511199
w0 + w1= 1.3100632441215612
a= 0.31503564986069105 b= 0.6849643501393089 a+b= 1.0
[0.11854074 0.57862593]
0.11854073835784273
0.5786259314340823
w0 + w1= 0.697166669791925
a= 0.1700321364950252 b= 0.8299678635049749 a+b= 1.0
Results printed outside the for-loop:
0.1700321364950252
0.8299678635049749
[0.17003214 0.82996786]

Related

How can I interpolate values from two lists (in Python)?

I am relatively new to coding in Python. I have mainly used MatLab in the past and am used to having vectors that can be referenced explicitly rather than appended lists. I have a script where I generate a list of x- and y- (z-, v-, etc) values. Later, I want to interpolate and then print a table of the values at specified points. Here is a MWE. The problem is at line 48:
yq = interp1d(x_list, y_list, xq(nn))#interp1(output1(:,1),output1(:,2),xq(nn))
I'm not sure I have the correct syntax for the last two lines either:
table[nn] = ('%.2f' %xq, '%.2f' %yq)
print(table)
Here is the full script for the MWE:
#This script was written to test how to interpolate after data was created in a loop and stored as a list. Can a list be accessed explicitly like a vector in matlab?
#
from scipy.interpolate import interp1d
from math import * #for ceil
from astropy.table import Table #for Table
import numpy as np
# define the initial conditions
x = 0 # initial x position
y = 0 # initial y position
Rmax = 10 # maxium range
""" initializing variables for plots"""
x_list = [x]
y_list = [y]
""" define functions"""
# not necessary for this MWE
"""create sample data for MWE"""
# x and y data are calculated using functions and appended to their respective lists
h = 1
t = 0
tf = 10
N=ceil(tf/h)
# Example of interpolation without a loop: https://docs.scipy.org/doc/scipy/tutorial/interpolate.html#d-interpolation-interp1d
#x = np.linspace(0, 10, num=11, endpoint=True)
#y = np.cos(-x**2/9.0)
#f = interp1d(x, y)
for i in range(N):
x = h*i
y = cos(-x**2/9.0)
""" appends selected data for ability to plot"""
x_list.append(x)
y_list.append(y)
## Interpolation after x- and y-lists are already created
intervals = 0.5
nfinal = ceil(Rmax/intervals)
NN = nfinal+1 # length of table
dtype = [('Range (units?)', 'f8'), ('Drop? (units)', 'f8')]
table = Table(data=np.zeros(N, dtype=dtype))
for nn in range(NN):#for nn = 1:NN
xq = 0.0 + (nn-1)*intervals #0.0 + (nn-1)*intervals
yq = interp1d(x_list, y_list, xq(nn))#interp1(output1(:,1),output1(:,2),xq(nn))
table[nn] = ('%.2f' %xq, '%.2f' %yq)
print(table)
Your help and patience will be greatly appreciated!
Best regards,
Alex

Your code has some glaring issues that made it really difficult to understand. Let's first take a look at some things I needed to fix:
for i in range(N):
x = h*1
y = cos(-x**2/9.0)
""" appends selected data for ability to plot"""
x_list.append(x)
y_list.append(y)
You are appending a single value without modifying it. What I presume you wanted is down below.
intervals = 0.5
nfinal = ceil(Rmax/intervals)
NN = nfinal+1 # length of table
dtype = [('Range (units?)', 'f8'), ('Drop? (units)', 'f8')]
table = Table(data=np.zeros(N, dtype=dtype))
for nn in range(NN):#for nn = 1:NN
xq = 0.0 + (nn-1)*intervals #0.0 + (nn-1)*intervals
yq = interp1d(x_list, y_list, xq(nn))#interp1(output1(:,1),output1(:,2),xq(nn))
table[nn] = ('%.2f' %xq, '%.2f' %yq)
This is where things get strange. First: use pandas tables, this is the more popular choice. Second: I have no idea what you are trying to loop over. What I presume you wanted was to vary the number of points for the interpolation, which I have done so below. Third: you are trying to interpolate a point, when you probably want to interpolate over a range of points (...interpolation). Lastly, you are using the interp1d function incorrectly. Please take a look at the code below or run it here; let me know what you exactly wanted (specifically: what should xq / xq(nn) be?), because the MRE you provided is quite confusing.
from scipy.interpolate import interp1d
from math import *
import numpy as np
Rmax = 10
h = 1
t = 0
tf = 10
N = ceil(tf/h)
x = np.arange(0,N+1)
y = np.cos(-x**2/9.0)
interval = 0.5
NN = ceil(Rmax/interval) + 1
ip_list = np.arange(1,interval*NN,interval)
xtable = []
ytable = []
for i,nn in enumerate(ip_list):
f = interp1d(x,y)
x_i = np.arange(0,nn+interval,interval)
xtable += [x_i]
ytable += [f(x_i)]
[print(i) for i in xtable]
[print(i) for i in ytable]

How to append a loop value to a list?

I don't fathom why the output isn't a list...am I appending wrongly?
from numpy import *
b=0.1;g=0.5;l=632.8;p=2;I1=[I];I=0
for a in arange(-0.2,0.2,0.001):
I+=b**2*(sin(pi/l*b*sin(a)))**2/(pi/l*b*sin(a))**2*(sin(p*pi /l*g*sin(a)))**2/(sin(pi/l*g*sin(a)))**2
I1.append(I)
print (I)
output: 15.999998678557855

Several errors in your code, missing imports etc. See comments inside code for fixes:
from numpy import arange
from math import sin,pi
b = 0.1
g = 0.5
l = 632.8
p = 2
I = 0 # you need to specify I
I1 = [I] # before you can add it
for a in arange(-0.2,0.2,0.001):
I += b**2 * (sin(pi/l*b*sin(a)))**2 / (pi/l*b*sin(a))**2 * (sin(p*pi /l*g*sin(a)))**2 / (sin(pi/l*g*sin(a)))**2
I1.append(I) # by indenting this you move it _inside_ the loop
print (I)
print (I1)
Output:
15.999998678557855
[0, 0.03999999014218294, 0.07999998038139602, 0.1199999707171788, ....] # shortened

How can I improve this solution to make it faster using numpy?

The problem statement:
An unnamed tourist got lost in New York. All he has is a map of M
metro stations, which shows the coordinates of the stations and his
own coordinates, which he saw on the nearby pointer. The tourist is
not sure that each of the stations is open, therefore, just in case,
he is looking for the nearest N stations. The tourist moves
through New York City like every New Yorker (Distance of city
quarters). Help the tourist to find these stations.
Sample input
5 2
А 1 2
B 4.5 1.2
C 100500 100500
D 100501 100501
E 100502 100502
1 1
Sample output
A B
My code:
import scipy.spatial.distance as d
import math
#finds N nearest metro stations in relation to the tourist
def find_shortest_N(distance_list, name_list, number_of_stations):
result = []
for num in range(0, number_of_stations):
min_val_index = distance_list.index(min(distance_list))
result.append(name_list[min_val_index])
distance_list.pop(min_val_index)
name_list.pop(min_val_index)
return result
#returns a list with distances between touri and stations
def calculate_nearest(list_of_coords, tourist_coords):
distances = []
for metro_coords in list_of_coords:
distances.append(math.fabs(d.cityblock(metro_coords, tourist_coords)))
return distances
station_coords = []
station_names = []
input_stations = input("Input a number of stations: ").split()
input_stations = list(map(int, input_stations))
#all station coordinates and their names
station_M = input_stations[0]
#number of stations a tourist wants to visit
stations_wanted_N = input_stations[1]
#distribute the station names in station_names list
#and the coordinates in station_coords list
for data in range(0, station_M):
str_input = input()
list_input = str_input.split()
station_names.append(list_input[0])
list_input.pop(0)
list_input = list(map(float, list_input))
station_coords.append(list_input)
tourist_coordinates = input("Enter tourist position: ").split()
tourist_coordinates = list(map(float, tourist_coordinates))
distance_values = calculate_nearest(station_coords, tourist_coordinates)
result = find_shortest_N(distance_values, station_names, stations_wanted_N)
for name in result:
print(name, end=" ")

You could also, for example, directly use the cdist function:
import numpy as np
from scipy.spatial.distance import cdist
sample_input = '''
5 2
А 1 2
B 4.5 1.2
C 100500 100500
D 100501 100501
E 100502 100502
1 1
'''
# Parsing the input data:
sample_data = [line.split()
for line in sample_input.strip().split('\n')]
tourist_coords = np.array(sample_data.pop(), dtype=float) # takes the last line
nbr_stations, nbr_wanted = [int(n) for n in sample_data.pop(0)] # takes the first line
stations_coords = np.array([line[1:] for line in sample_data], dtype=float)
stations_names = [line[0] for line in sample_data]
# Computing the distances:
tourist_coords = tourist_coords.reshape(1, 2) # have to be a 2D array
distance = cdist(stations_coords, tourist_coords, metric='cityblock')
# Sorting the distances:
sorted_distance = sorted(zip(stations_names, distance), key=lambda x:x[1])
# Result:
result = [name for name, dist in sorted_distance[:nbr_wanted]]
print(result)

Use scipy.spatial.KDTree
from scipy.spatial import KDTree
subway_tree = KDTree(stations_coords)
dist, idx = subway_tree.query(tourist_coords, nbr_wanted, p = 1)
nearest_stations = station_names[idx]

Issue in passing an array to an index in Series object(TypeError: len() of unsized object)

I have a data as ndarray
a = [0,1,0,1,2,2]
b = ['x1','x2','x3','x4','x5','x6']
Then I tried:
univals = set(a)
serObj=pd.Series()
for ele in univals:
indexfound=np.where(a == ele)
Xpointsfromindex=np.take(b, indexfound)
serobj1=pd.Series(Xpointsfromindex[0],index=ele) ##error happening here
serObj.apend(serobj1)
print(serObj)
I expect output to be like
0 ['x1','x3']
1 ['x2','x4']
2 ['x5','x6']
But it is giving me an error like "TypeError: len() of unsized object"
Where am I doing wrong?

I believe here is possible create DataFrame if same length of lists and then create lists with groupby:
a = [0,1,0,1,2,2]
b = ['x1','x2','x3','x4','x5','x6']
df = pd.DataFrame({'a':a, 'b':b})
print(df)
a b
0 0 x1
1 1 x2
2 0 x3
3 1 x4
4 2 x5
5 2 x6
serObj = df.groupby('a')['b'].apply(list)
print (serObj)
a
0 [x1, x3]
1 [x2, x4]
2 [x5, x6]
Name: b, dtype: object

Just to stick to what OP was doing, here is the full code that works -
import pandas as pd
import numpy as np
a = [0,1,0,1,2,2]
b = ['x1','x2','x3','x4','x5','x6']
univals = set(a)
serObj=pd.Series()
for ele in univals:
indexfound=np.where([i==ele for i in a])
Xpointsfromindex=np.take(b, indexfound)
print(Xpointsfromindex)
serobj1=pd.Series(Xpointsfromindex[0],index=[ele for _ in range(np.shape(indexfound)[1])]) ##error happening here
serObj.append(serobj1)
print(serObj)
Output
[['x1' 'x3']]
[['x2' 'x4']]
[['x5' 'x6']]
Explanation
indexfound=np.where(a == ele) will always return False because you are trying to compare a list with a scalar. Changing it to list comprehension fetches the indices
The next change is using list comprehension at the index parameter of the pd.Series.
This will set you on your way to what you want to achieve

Python Pandas: bootstrap confidence limits by row rather than entire dataframe

What I am trying to do is to get bootstrap confidence limits by row regardless of the number of rows and make a new dataframe from the output.I currently can do this for the entire dataframe, but not by row. The data I have in my actual program looks similar to what I have below:
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4
I want the new dataframe to look something like this with the lower and upper confidence limits:
0 1
0 1 2
1 1 5.5
2 1 4.5
3 1 4.2
The current generated output looks like this:
0 1
0 2.0 2.75
The python 3 code below generates a mock dataframe and generates the bootstrap confidence limits for the entire dataframe. The result is a new dataframe with just 2 values, a upper and a lower confidence limit rather than 4 sets of 2(one for each row).
import pandas as pd
import numpy as np
import scikits.bootstrap as sci
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
print(zz)
x= zz.dtypes
print(x)
a = pd.DataFrame(np.array(zz.values.tolist())[:, :, 0],zz.index, zz.columns)
print(a)
b = sci.ci(a)
b = pd.DataFrame(b)
b = b.T
print(b)
Thank you for any help.

scikits.bootstrap operates by assuming that data samples are arranged by row, not by column. If you want the opposite behavior, just use the transpose, and a statfunction that doesn't combine columns.
import pandas as pd
import numpy as np
import scikits.bootstrap as sci
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
print(zz)
x= zz.dtypes
print(x)
a = pd.DataFrame(np.array(zz.values.tolist())[:, :, 0],zz.index, zz.columns)
print(a)
b = sci.ci(a.T, statfunction=lambda x: np.average(x, axis=0))
print(b.T)

Below is the answer I ended up figuring out to create bootstrap ci by row.
import pandas as pd
import numpy as np
import numpy.random as npr
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
x= zz.dtypes
a = pd.DataFrame(np.array(zz.values.tolist())[:, :, 0],zz.index, zz.columns)
print(a)
def bootstrap(data, num_samples, statistic, alpha):
n = len(data)
idx = npr.randint(0, n, (num_samples, n))
samples = data[idx]
stat = np.sort(statistic(samples, 1))
return (stat[int((alpha/2.0)*num_samples)],
stat[int((1-alpha/2.0)*num_samples)])
cc = list(a.index.values) # informs generator of the number of rows
def bootbyrow(cc):
for xx in range(1):
xx = list(a.index.values)
for xx in range(len(cc)):
k = a.apply(lambda y: y[xx])
k = k.values
for xx in range(1):
kk = list(bootstrap(k,10000,np.mean,0.05))
yield list(kk)
abc = pd.DataFrame(list(bootbyrow(cc))) #bootstrap ci by row
# the next 4 just show that its working correctly
a0 = bootstrap((a.loc[0,].values),10000,np.mean,0.05)
a1 = bootstrap((a.loc[1,].values),10000,np.mean,0.05)
a2 = bootstrap((a.loc[2,].values),10000,np.mean,0.05)
a3 = bootstrap((a.loc[3,].values),10000,np.mean,0.05)
print(abc)
print(a0)
print(a1)
print(a2)
print(a3)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pandas .describe() returns wrong column values in table - python-3.x

I'am not 100% sure i got your question correctly, but an issue might be, that you are not reassigning the output to new variable, therefore not saving it. Try to adjust your code in this matter: new_portfolio = new_portfolio.sort_values(by="Returns") Or turn inplace parameter to True - link

Related

How can I interpolate values from two lists (in Python)?

How to append a loop value to a list?

How can I improve this solution to make it faster using numpy?

Issue in passing an array to an index in Series object(TypeError: len() of unsized object)

Python Pandas: bootstrap confidence limits by row rather than entire dataframe

Categories

Resources