Statistical Analysis with mixed numbers - statistics

I am working on a Survey for a small project that requires users to respond to some questions by selecting from a set of radio values, such as Strongly Disagree, Agree, Neutral, Agree and Strongly Agree. For these selections, the Radio values are -1, -2, 0, 1, and 2, respectively. Finally, I need to perform some type of analysis on the data.
First, I used Python to try to normalized the values, utilizing the Log10 function
import numpy as np
feel = [-1,-2,0,1,2]
for i in feel:
print( np.log10(i))
The results are not favorable:
-inf
nan
0.0
0.6931471805599453
1.0986122886681098
<ipython-input-35-830bb9e2f96e>:3: RuntimeWarning: divide by zero encountered in log1p
print( np.log1p(i))
<ipython-input-35-830bb9e2f96e>:3: RuntimeWarning: invalid value encountered in log1p
print( np.log1p(i))
If I use C# to repeat the Log10 normalization:
List<double> origin = new List<double> { -1,-2,0,1,2};
Program p = new Program();
var norm = 0.0;
var denorm = 0.0;
foreach(var item in origin){
System.Console.WriteLine($"Number: {item}");
norm = p.normalize(item); // 0.2
System.Console.WriteLine($"Normalized: {norm}");
denorm = p.denormalize(norm); //12
System.Console.WriteLine($"Denormalized: {denorm}");
}
public double normalize(double value)
{
var norm = Math.Log10(value);
return norm;
}
public double denormalize(double value)
{
var denorm = Math.Round(Math.Pow(10,value),14);
return denorm;
}
I get:
Number: -1
Normalized: NaN
Denormalized: NaN
Number: -2
Normalized: NaN
Denormalized: NaN
Number: 0
Normalized: -∞
Denormalized: 0
Number: 1
Normalized: 0
Denormalized: 1
Number: 2
Normalized: 0.3010299956639812
Denormalized: 2
Is there a finite way to collect Survey data to then normalize and to finally run some analysis for an attitudinal approach?

The problem with using np.log10 is that there is no root on base 10 for negative numbers. In other words, 10^x = y with y < 0 is not solvable. If you want or need to use that function in particular you will need to sum 3 to all your options. That is, instead of going from -2 to 2 they should go from 1 to 5.
import numpy as np
feel = [1,2,3,4,5]
for i in feel:
print(np.log10(i))
This outputs:
>>> 0.0
>>> 0.3010299956639812
>>> 0.47712125471966244
>>> 0.6020599913279624
>>> 0.6989700043360189

Related

Pandas .describe() returns wrong column values in table

Look at the gld_weight column of figure 1. It is throwing off completely wrong values. The btc_weight + gld_weight should always adds up to 1. But why is the gld_weight column not corresponding to the returned row values when I used the describe function?
Figure 1:
Figure 2:
Figure 3:
This is my source code:
import numpy as np
import pandas as pd
from pandas_datareader import data as wb
import matplotlib.pyplot as plt
assets = ['BTC-USD', 'GLD']
mydata = pd.DataFrame()
for asset in assets:
mydata[asset] = wb.DataReader(asset, data_source='yahoo', start='2015-1-1')['Close']
cleandata = mydata.dropna()
log_returns = np.log(cleandata/cleandata.shift(1))
annual_log_returns = log_returns.mean() * 252 * 100
annual_log_returns
annual_cov = log_returns.cov() * 252
annual_cov
pfolio_returns = []
pfolio_volatility = []
btc_weight = []
gld_weight = []
for x in range(1000):
weights = np.random.random(2)
weights[0] = weights[0]/np.sum(weights)
weights[1] = weights[1]/np.sum(weights)
weights /= np.sum(weights)
btc_weight.append(weights[0])
gld_weight.append(weights[1])
pfolio_returns.append(np.dot(annual_log_returns, weights))
pfolio_volatility.append(np.sqrt(np.dot(weights.T, np.dot(annual_cov, weights))))
pfolio_returns
pfolio_volatility
npfolio_returns = np.array(pfolio_returns)
npfolio_volatility = np.array(pfolio_volatility)
new_portfolio = pd.DataFrame({
'Returns': npfolio_returns,
'Volatility': npfolio_volatility,
'btc_weight': btc_weight,
'gld_weight': gld_weight
})
I'am not 100% sure i got your question correctly, but an issue might be, that you are not reassigning the output to new variable, therefore not saving it.
Try to adjust your code in this matter:
new_portfolio = new_portfolio.sort_values(by="Returns")
Or turn inplace parameter to True - link
Short answer :
The issue at hand was found in the for-loop were the initial weight value normalization was done. How its fixed: see update 1 below in the answer.
Background to getting the solution:
At first glance the code of OP seemed to be in order and values in the arrays were fitted as expected by the requests OP made via the written codes. From testing it appeared that with range(1000) was asking for trouble because value-outcome oversight was lost due to the vast amount of "randomness" results. Especially as the question was written as a transformation issue. So x/y axis values mixing or some other kind of transformation error was hard to study.
To tackle this I used static values as can be seen for annual_log_returns and annual_cov.
Then I've locked all outputs for print so the values become locked in place and can't be changed further down the processing. .. it was possible that the prints of code changed during run-time because the arrays were not locked (also suggested by Pavel Klammert in his answer).
After commented feedback I've figured out what OP meant with "the values are wrong. I then focused on the method how the used values, to fill the arrays, were created.
The issue of "throwing wrong values was found :
The use of weights[0] = weights[0]/np.sum(weights) replaces the original list weights[0] value for new weights[0] which then serves as new input for weights[1] = weights[1]/np.sum(weights) and therefore sum = 1 is never reached.
The variable names weights[0] and weights[1] were then changed into 'a' and 'b' at two places directly after the creation of weights [0] and [1] values to prevent overwriting the initial weights values. Then the outcome is as "planned".
Problem solved.
import numpy as np
import pandas as pd
pfolio_returns = []
pfolio_volatility = []
btc_weight = []
gld_weight = []
annual_log_returns = [0.69, 0.71]
annual_cov = 0.73
ranger = 5
for x in range(ranger):
weights = np.random.random(2)
weights[0] = weights[0]/np.sum(weights)
weights[1] = weights[1]/np.sum(weights)
weights /= np.sum(weights)
btc_weight.append(weights[0])
gld_weight.append(weights[1])
pfolio_returns.append(np.dot(annual_log_returns, weights))
pfolio_volatility.append(np.sqrt(np.dot(weights.T, np.dot(annual_cov, weights))))
print (weights[0])
print (weights[1])
print (weights)
#print (pfolio_returns)
#print (pfolio_volatility)
npfolio_returns = np.array(pfolio_returns)
npfolio_volatility = np.array(pfolio_volatility)
#df = pd.DataFrame(array, index = row_names, columns=colomn_names, dtype = dtype)
new_portfolio = pd.DataFrame({'Returns': npfolio_returns, 'Volatility': npfolio_volatility, 'btc_weight': btc_weight, 'gld_weight': gld_weight})
print (new_portfolio, '\n')
sort = new_portfolio.sort_values(by='Returns')
sort_max_gld_weight = sort.loc[ranger-1, 'gld_weight']
print ('Sort:\n', sort, '\n')
print ('sort max_gld_weight : "%s"\n' % sort_max_gld_weight) # if "999" contains the highest gld_weight... but most cases its not!
sort_max_gld_weight = sort.max(axis=0)[3] # this returns colomn 4 'gld_weight' value.
print ('sort max_gld_weight : "%s"\n' % sort_max_gld_weight) # this returns colomn 4 'gld_weight' value.
desc = new_portfolio.describe()
desc_max_gld_weight =desc.loc['max', 'gld_weight']
print ('Describe:\n', desc, '\n')
print ('desc max_gld_weight : "%s"\n' % desc_max_gld_weight)
max_val_gld = new_portfolio.loc[new_portfolio['gld_weight'] == sort_max_gld_weight]
print('max val gld:\n', max_val_gld, '\n')
locations = new_portfolio.loc[new_portfolio['gld_weight'] > 0.99]
print ('location:\n', locations)
Result can be for example:
0.9779586087178525
0.02204139128214753
[0.97795861 0.02204139]
Returns Volatility btc_weight gld_weight
0 0.702820 0.627707 0.359024 0.640976
1 0.709807 0.846179 0.009670 0.990330
2 0.708724 0.801756 0.063786 0.936214
3 0.702010 0.616237 0.399496 0.600504
4 0.690441 0.835780 0.977959 0.022041
Sort:
Returns Volatility btc_weight gld_weight
4 0.690441 0.835780 0.977959 0.022041
3 0.702010 0.616237 0.399496 0.600504
0 0.702820 0.627707 0.359024 0.640976
2 0.708724 0.801756 0.063786 0.936214
1 0.709807 0.846179 0.009670 0.990330
sort max_gld_weight : "0.02204139128214753"
sort max_gld_weight : "0.9903300366638084"
Describe:
Returns Volatility btc_weight gld_weight
count 5.000000 5.000000 5.000000 5.000000
mean 0.702760 0.745532 0.361987 0.638013
std 0.007706 0.114057 0.385321 0.385321
min 0.690441 0.616237 0.009670 0.022041
25% 0.702010 0.627707 0.063786 0.600504
50% 0.702820 0.801756 0.359024 0.640976
75% 0.708724 0.835780 0.399496 0.936214
max 0.709807 0.846179 0.977959 0.990330
desc max_gld_weight : "0.9903300366638084"
max val gld:
Returns Volatility btc_weight gld_weight
1 0.709807 0.846179 0.00967 0.99033
loacation:
Returns Volatility btc_weight gld_weight
1 0.709807 0.846179 0.00967 0.99033
Update 1 :
for x in range(ranger):
weights = np.random.random(2)
print (weights)
a = weights[0]/np.sum(weights) # see comments below.
print (weights[0])
b = weights[1]/np.sum(weights) # see comments below.
print (weights[1])
print ('w0 + w1=', weights[0] + weights[1])
weights /= np.sum(weights)
btc_weight.append(a)
gld_weight.append(b)
print('a=', a, 'b=',b , 'a+b=', a+b)
The new output becomes for example:
[0.37710183 0.72933416]
0.3771018292953062
0.7293341569809412
w0 + w1= 1.1064359862762474
a= 0.34082570882790686 b= 0.6591742911720931 a+b= 1.0
[0.09301326 0.05296838]
0.09301326441107827
0.05296838430180717
w0 + w1= 0.14598164871288544
a= 0.637157240181712 b= 0.3628427598182879 a+b= 1.0
[0.48501305 0.56078073]
0.48501305100305336
0.5607807281299131
w0 + w1= 1.0457937791329663
a= 0.46377503928658087 b= 0.5362249607134192 a+b= 1.0
[0.41271663 0.89734662]
0.4127166254704412
0.8973466186511199
w0 + w1= 1.3100632441215612
a= 0.31503564986069105 b= 0.6849643501393089 a+b= 1.0
[0.11854074 0.57862593]
0.11854073835784273
0.5786259314340823
w0 + w1= 0.697166669791925
a= 0.1700321364950252 b= 0.8299678635049749 a+b= 1.0
Results printed outside the for-loop:
0.1700321364950252
0.8299678635049749
[0.17003214 0.82996786]

How to have an algorithm round in a "smart" way automatically

I would like to round number in a code but in a way that it adapts to each values.
For example i would like a rounding algorithm to return :
0.999999 it should return 1
0.0749999 it should return 0.075
0.006599 it should return 0.0066
and so on ...
I don't know in advance the number of digits (which is kinda my problem)
I was thinking to use strings to find where are the 9s (or count the 0s) but it is quite a lot of effort for that i was thinking ?
If you know any way to do that (if possible without advanced libraries) i would appreciate.
Thanks.
It's some complicated. but, It works. Please make sure the result is what you want. I think you can understand how to round the number from code.
def clear9(numstr):
liststr = list(numstr)
for index in range(len(liststr)-1,-1,-1):
if liststr[index] == '.': continue
if liststr[index] == '9':
liststr[index] = '0'
if index == 0:
liststr.insert(0, '1')
else:
if index != len(liststr)-1:
liststr[index] = str(int(liststr[index])+1)
break
numstr = ''
for item in liststr:
numstr += item
return numstr
def myround(num):
numstr = str(num)
numstr = clear9(numstr)
return float(numstr)
print (myround(9.05))
print (myround(9.999999))
print (myround(0.999999))
print (myround(0.0749999))
print (myround(0.006599))
print (myround(0.00659923))
print (myround(0.09659923))
print (myround(-0.00659923))
9.05
10.0
1.0
0.075
0.0066
0.00659923
0.09659923
-0.00659923
import math
def round_(number):
dist = int(math.log10(abs(number))) #number of zeros after decimal point
return (round(number, abs(dist) + 2) if dist != 0 else round(number))
print(round_(0.999999))
print(round_(0.0749999))
print(round_(0.006599))
print(round_(-0.00043565))
output:
1
0.075
0.0066
-0.00044
Dealing with floating point numbers is tricky. You want to do a kind of round-off in base 10, but floating point numbers are base 2.
So I propose to use the decimal module, which can represent real numbers exactly, as opposed to base-2 floating point.:
from decimal import Decimal
def myround(num):
dec = Decimal(num)
adj = abs(dec.adjusted())+1
return round(num, adj)
Look at the documentation for Decimal.adjusted() to understand how this works.
A test:
In [1]: from decimal import Decimal
In [2]: def myround(num):
...: dec = Decimal(num)
...: adj = abs(dec.adjusted())+1
...: return round(num, adj)
...:
In [3]: myround(0.999999)
Out[3]: 1.0
In [4]: myround(0.006599)
Out[4]: 0.0066
In [5]: myround(0.0749999)
Out[5]: 0.075

How to return floating values using floor division

In Python 3, I want to return the units place of an integer value, then tens, then hundreds and so on. Suppose I have an integer 456, first I want to return 6, then 5 then 4. Is there any way? I tried floor division and for loop but didn't work.
If you look at the list of basic operators from the documentation, for example here,
Operator Description Example
% Modulus Divides left hand operand by right hand operand and returns remainder b % a = 1
// Floor Division - The division of operands where the result is the quotient in which the digits after the decimal point are removed. But if one of the operands is negative, the result is floored, i.e., rounded away from zero (towards negative infinity): 9//2 = 4 and 9.0//2.0 = 4.0, -11//3 = -4, -11.0//3 = -4.0
With that knowledge, you can get what you want as follows:
In [1]: a = 456
In [2]: a % 10
Out[2]: 6
In [3]: (a % 100) // 10
Out[3]: 5
In [4]: a // 100
Out[4]: 4
Write a generator if you want to retrieve digits in different places of your code based on requirement as follows.
If you are not much familiar with Python's generator, have a quick look at https://www.programiz.com/python-programming/generator.
» Here get_digits() is a generator.
def get_digits(n):
while str(n):
yield n % 10
n = n // 10
if not n:
break
digit = get_digits(1729)
print(next(digit)) # 9
print(next(digit)) # 2
print(next(digit)) # 7
print(next(digit)) # 1
» If you wish to iterate over digits, you can also do so as follows.
for digit in get_digits(74831965):
print(digit)
# 5
# 6
# 9
# 1
# 3
# 8
# 4
# 7
» Quick overview about its usage (On Python3's Interactive terminal).
>>> def letter(name):
... for ch in name:
... yield ch
...
>>>
>>> char = letter("RISHIKESH")
>>>
>>> next(char)
'R'
>>>
>>> "Second letter is my name is: " + next(char)
'Second letter is my name is: I'
>>>
>>> "3rd one: " + next(char)
'3rd one: S'
>>>
>>> next(char)
'H'
>>>

Python Pandas: bootstrap confidence limits by row rather than entire dataframe

What I am trying to do is to get bootstrap confidence limits by row regardless of the number of rows and make a new dataframe from the output.I currently can do this for the entire dataframe, but not by row. The data I have in my actual program looks similar to what I have below:
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4
I want the new dataframe to look something like this with the lower and upper confidence limits:
0 1
0 1 2
1 1 5.5
2 1 4.5
3 1 4.2
The current generated output looks like this:
0 1
0 2.0 2.75
The python 3 code below generates a mock dataframe and generates the bootstrap confidence limits for the entire dataframe. The result is a new dataframe with just 2 values, a upper and a lower confidence limit rather than 4 sets of 2(one for each row).
import pandas as pd
import numpy as np
import scikits.bootstrap as sci
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
print(zz)
x= zz.dtypes
print(x)
a = pd.DataFrame(np.array(zz.values.tolist())[:, :, 0],zz.index, zz.columns)
print(a)
b = sci.ci(a)
b = pd.DataFrame(b)
b = b.T
print(b)
Thank you for any help.
scikits.bootstrap operates by assuming that data samples are arranged by row, not by column. If you want the opposite behavior, just use the transpose, and a statfunction that doesn't combine columns.
import pandas as pd
import numpy as np
import scikits.bootstrap as sci
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
print(zz)
x= zz.dtypes
print(x)
a = pd.DataFrame(np.array(zz.values.tolist())[:, :, 0],zz.index, zz.columns)
print(a)
b = sci.ci(a.T, statfunction=lambda x: np.average(x, axis=0))
print(b.T)
Below is the answer I ended up figuring out to create bootstrap ci by row.
import pandas as pd
import numpy as np
import numpy.random as npr
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
x= zz.dtypes
a = pd.DataFrame(np.array(zz.values.tolist())[:, :, 0],zz.index, zz.columns)
print(a)
def bootstrap(data, num_samples, statistic, alpha):
n = len(data)
idx = npr.randint(0, n, (num_samples, n))
samples = data[idx]
stat = np.sort(statistic(samples, 1))
return (stat[int((alpha/2.0)*num_samples)],
stat[int((1-alpha/2.0)*num_samples)])
cc = list(a.index.values) # informs generator of the number of rows
def bootbyrow(cc):
for xx in range(1):
xx = list(a.index.values)
for xx in range(len(cc)):
k = a.apply(lambda y: y[xx])
k = k.values
for xx in range(1):
kk = list(bootstrap(k,10000,np.mean,0.05))
yield list(kk)
abc = pd.DataFrame(list(bootbyrow(cc))) #bootstrap ci by row
# the next 4 just show that its working correctly
a0 = bootstrap((a.loc[0,].values),10000,np.mean,0.05)
a1 = bootstrap((a.loc[1,].values),10000,np.mean,0.05)
a2 = bootstrap((a.loc[2,].values),10000,np.mean,0.05)
a3 = bootstrap((a.loc[3,].values),10000,np.mean,0.05)
print(abc)
print(a0)
print(a1)
print(a2)
print(a3)

Missing value imputation in Python

I have two huge vectors item_clusters and beta. The element item_clusters [ i ] is the cluster id to which the item i belongs. The element beta [ i ] is a score given to the item i. Scores are {-1, 0, 1, 2, 3}.
Whenever the score of a particular item is 0, I have to impute that with the average non-zero score of other items belonging to the same cluster. What is the fastest possible way to to this?
This is what I have tried so far. I converted the item_clusters to a matrix clusters_to_items such that the element clusters_to_items [ i ][ j ] = 1 if the cluster i contains item j, else 0. After that I am running the following code.
# beta (1x1.3M) csr matrix
# num_clusters = 1000
# item_clusters (1x1.3M) numpy.array
# clust_to_items (1000x1.3M) csr_matrix
alpha_z = []
for clust in range(0, num_clusters):
alpha = clust_to_items[clust, :]
alpha_beta = beta.multiply(alpha)
sum_row = alpha_beta.sum(1)[0, 0]
num_nonzero = alpha_beta.nonzero()[1].__len__() + 0.001
to_impute = sum_row / num_nonzero
Z = np.repeat(to_impute, beta.shape[1])
alpha_z = alpha.multiply(Z)
idx = beta.nonzero()
alpha_z[idx] = beta.data
interact_score = alpha_z.tolist()[0]
# The interact_score is the required modified beta
# This is used to do some work that is very fast
The problem is that this code has to run 150K times and it is very slow. It will take 12 days to run according to my estimate.
Edit: I believe, I need some very different idea in which I can directly use item_clusters, and do not need to iterate through each cluster separately.
I don't know if this means I'm the popular kid here or not, but I think you can vectorize your operations in the following way:
def fast_impute(num_clusters, item_clusters, beta):
# get counts
cluster_counts = np.zeros(num_clusters)
np.add.at(cluster_counts, item_clusters, 1)
# get complete totals
totals = np.zeros(num_clusters)
np.add.at(totals, item_clusters, beta)
# get number of zeros
zero_counts = np.zeros(num_clusters)
z = beta == 0
np.add.at(zero_counts, item_clusters, z)
# non-zero means
cluster_means = totals / (cluster_counts - zero_counts)
# perform imputations
imputed_beta = np.where(beta != 0, beta, cluster_means[item_clusters])
return imputed_beta
which gives me
>>> N = 10**6
>>> num_clusters = 1000
>>> item_clusters = np.random.randint(0, num_clusters, N)
>>> beta = np.random.choice([-1, 0, 1, 2, 3], size=len(item_clusters))
>>> %time imputed = fast_impute(num_clusters, item_clusters, beta)
CPU times: user 652 ms, sys: 28 ms, total: 680 ms
Wall time: 679 ms
and
>>> imputed[:5]
array([ 1.27582017, -1. , -1. , 1. , 3. ])
>>> item_clusters[:5]
array([506, 968, 873, 179, 269])
>>> np.mean([b for b, i in zip(beta, item_clusters) if i == 506 and b != 0])
1.2758201701093561
Note that I did the above manually. It would be a lot easier if you were using higher-level tools, say like those provided by pandas:
>>> df = pd.DataFrame({"beta": beta, "cluster": item_clusters})
>>> df.head()
beta cluster
0 0 506
1 -1 968
2 -1 873
3 1 179
4 3 269
>>> df["beta"] = df["beta"].replace(0, np.nan)
>>> df["beta"] = df["beta"].fillna(df["beta"].groupby(df["cluster"]).transform("mean"))
>>> df.head()
beta cluster
0 1.27582 506
1 -1.00000 968
2 -1.00000 873
3 1.00000 179
4 3.00000 269
My suspicion is that
alpha_beta = beta.multiply(alpha)
is a terrible idea, because you only need the first elements of the row sums, so you're doing a couple million multiply-adds in vain, if I'm not mistaken:
sum_row = alpha_beta.sum(1)[0, 0]
So, write down the discrete formula for beta * alpha, then pick the row you need and derive the formula for its sum.

Resources