Missing value imputation in Python - python-3.x

I have two huge vectors item_clusters and beta. The element item_clusters [ i ] is the cluster id to which the item i belongs. The element beta [ i ] is a score given to the item i. Scores are {-1, 0, 1, 2, 3}.
Whenever the score of a particular item is 0, I have to impute that with the average non-zero score of other items belonging to the same cluster. What is the fastest possible way to to this?
This is what I have tried so far. I converted the item_clusters to a matrix clusters_to_items such that the element clusters_to_items [ i ][ j ] = 1 if the cluster i contains item j, else 0. After that I am running the following code.
# beta (1x1.3M) csr matrix
# num_clusters = 1000
# item_clusters (1x1.3M) numpy.array
# clust_to_items (1000x1.3M) csr_matrix
alpha_z = []
for clust in range(0, num_clusters):
alpha = clust_to_items[clust, :]
alpha_beta = beta.multiply(alpha)
sum_row = alpha_beta.sum(1)[0, 0]
num_nonzero = alpha_beta.nonzero()[1].__len__() + 0.001
to_impute = sum_row / num_nonzero
Z = np.repeat(to_impute, beta.shape[1])
alpha_z = alpha.multiply(Z)
idx = beta.nonzero()
alpha_z[idx] = beta.data
interact_score = alpha_z.tolist()[0]
# The interact_score is the required modified beta
# This is used to do some work that is very fast
The problem is that this code has to run 150K times and it is very slow. It will take 12 days to run according to my estimate.
Edit: I believe, I need some very different idea in which I can directly use item_clusters, and do not need to iterate through each cluster separately.

I don't know if this means I'm the popular kid here or not, but I think you can vectorize your operations in the following way:
def fast_impute(num_clusters, item_clusters, beta):
# get counts
cluster_counts = np.zeros(num_clusters)
np.add.at(cluster_counts, item_clusters, 1)
# get complete totals
totals = np.zeros(num_clusters)
np.add.at(totals, item_clusters, beta)
# get number of zeros
zero_counts = np.zeros(num_clusters)
z = beta == 0
np.add.at(zero_counts, item_clusters, z)
# non-zero means
cluster_means = totals / (cluster_counts - zero_counts)
# perform imputations
imputed_beta = np.where(beta != 0, beta, cluster_means[item_clusters])
return imputed_beta
which gives me
>>> N = 10**6
>>> num_clusters = 1000
>>> item_clusters = np.random.randint(0, num_clusters, N)
>>> beta = np.random.choice([-1, 0, 1, 2, 3], size=len(item_clusters))
>>> %time imputed = fast_impute(num_clusters, item_clusters, beta)
CPU times: user 652 ms, sys: 28 ms, total: 680 ms
Wall time: 679 ms
and
>>> imputed[:5]
array([ 1.27582017, -1. , -1. , 1. , 3. ])
>>> item_clusters[:5]
array([506, 968, 873, 179, 269])
>>> np.mean([b for b, i in zip(beta, item_clusters) if i == 506 and b != 0])
1.2758201701093561
Note that I did the above manually. It would be a lot easier if you were using higher-level tools, say like those provided by pandas:
>>> df = pd.DataFrame({"beta": beta, "cluster": item_clusters})
>>> df.head()
beta cluster
0 0 506
1 -1 968
2 -1 873
3 1 179
4 3 269
>>> df["beta"] = df["beta"].replace(0, np.nan)
>>> df["beta"] = df["beta"].fillna(df["beta"].groupby(df["cluster"]).transform("mean"))
>>> df.head()
beta cluster
0 1.27582 506
1 -1.00000 968
2 -1.00000 873
3 1.00000 179
4 3.00000 269

My suspicion is that
alpha_beta = beta.multiply(alpha)
is a terrible idea, because you only need the first elements of the row sums, so you're doing a couple million multiply-adds in vain, if I'm not mistaken:
sum_row = alpha_beta.sum(1)[0, 0]
So, write down the discrete formula for beta * alpha, then pick the row you need and derive the formula for its sum.

Related

Numpy.apply_along_axis works unexpectedly when applying a function with if else condition

I had trouble getting unexpected results. The following code reproduces the result in a simplest way(f is just a test function):
#returns absolute difference between last and first element in an array
def f(arr):
return 0 if arr[-1] == arr[0] else abs(arr[-1]-arr[0])
def test_vectorized(test_arr, window = 2):
T = test_arr.shape[0]
#create sliding windows
slide_windows = np.expand_dims(np.arange(window+1), axis=0) + np.expand_dims(np.arange(T - window), axis=0).T
print(slide_windows)
slide_values = test_arr[slide_windows]
print(slide_values)
#apply function to each sliding window
return np.apply_along_axis(f, axis=1, arr=slide_values)
#testing
test_arr = np.array([27.75, 27.71, 28.05, 27.75, 26.55,27.18])
test_vectorized(test_arr, window=3)
#Output
[[0 1 2 3]
[1 2 3 4]
[2 3 4 5]]
[[27.75 27.71 28.05 27.75]
[27.71 28.05 27.75 26.55]
[28.05 27.75 26.55 27.18]]
Out[238]:
array([0, 1, 0])
The code should return array([0,1.16,0.87]), i.e. the absolute difference between first and last element in each of the sliding arrays.
I'm using Jupyter notebook with python 3.8.2. I've spent more than an hour debugging, but seems like there's no problem with the code itself. Could anyone help? Highly appreciated.
Your function f returns integers.
you have to use:
def f(arr):
return float(0 if arr[-1] == arr[0] else abs(arr[-1]-arr[0]))
[[0 1 2 3]
[1 2 3 4]
[2 3 4 5]]
[[27.75 27.71 28.05 27.75]
[27.71 28.05 27.75 26.55]
[28.05 27.75 26.55 27.18]]
[0. 1.16 0.87]
P.S
your function f can be generalized to simple return abs(arr[-1]-arr[0]) as it covers the 0 case. You don't need the if statement.

Identify similar numbers from several lists

I have 3 lists:
r=[0.611695403733703, 0.833193902333201, 1.09120811998494]
g=[0.300675698437847, 0.612539072191236, 1.18046695352626]
b=[0.00668849762984564, 0.611946522017357, 1.16778502636141]
I want to calculate the average of the most similar numbers. In the example above, r[0], g[1] and b[1] are very similar (approximately 0.61...). How can I identify this kind of pattern?
Brute force using list comprehensions:
r=[0.611695403733703, 0.833193902333201, 1.09120811998494]
g=[0.300675698437847, 0.612539072191236, 1.18046695352626]
b=[0.00668849762984564, 0.611946522017357, 1.16778502636141]
rg = [ (idx_r, idx_g,r,g) if abs(rr-gg) < 0.001 else None
for idx_r,rr in enumerate(r)
for idx_g, gg in enumerate(g)]
rb = [ (idx_r, idx_b,r,b) if abs(rr-bb) < 0.001 else None
for idx_r,rr in enumerate(r)
for idx_b, bb in enumerate(b)]
gb = [ (idx_g, idx_b,g,b) if abs(gg-bb) < 0.001 else None
for idx_g,gg in enumerate(g)
for idx_b, bb in enumerate(b)]
print(filter(None,rg+rb+gb))
Output:
[(0, 1, [0.611695403733703, 0.833193902333201, 1.09120811998494],
[0.300675698437847, 0.612539072191236, 1.18046695352626]),
(0, 1, [0.611695403733703, 0.833193902333201, 1.09120811998494],
[0.00668849762984564, 0.611946522017357, 1.16778502636141]),
(1, 1, [0.300675698437847, 0.612539072191236, 1.18046695352626],
[0.00668849762984564, 0.611946522017357, 1.16778502636141])]
Output are tuples of index in 1. list, index in 2. list and both lists.
You are looking to compute the distance between all sets of points. Best way to do this is scipy.spatial.distance.cdist:
from scipy.spatial.distance import cdist
import numpy as np
r=[0.611695403733703, 0.833193902333201, 1.09120811998494]
g=[0.300675698437847, 0.612539072191236, 1.18046695352626]
b=[0.00668849762984564, 0.611946522017357, 1.16778502636141]
arr = np.array([r,g,b])
# need 2d set of points
arr_flat = arr.ravel()[:, np.newaxis]
# computes distance between every point, pairwise
dists = cdist(arr_flat, arr_flat)
# (1,2) is the same as (2,1), so only consider each pair once
# ie. use upper triangle
dists = np.triu(dists)
# set 0 values to inf so we don't consider the,m
dists[dists == 0] = np.inf
# get all pairs that are below this threshold level
ahold = 0.01
coords = np.nonzero(dists<thold)
labels = 'rgb'
print(f'Pairs of points closer than {thold}:')
for i, j in zip(*coords):
print(labels[i//3] + f'[{i%3}]', labels[j//3] + f'[{j%3}]')
>>> Pairs of points closer than 0.01:
r[0] g[1]
r[0] b[1]
g[1] b[1]
# can easily count the number of points as
np.count_nonzero(dists<thold)
>>> 3

motifs function in the module matrixprofile.motifs doesn't return the exact count given in max_motifs(ts, mp, max_motifs=10)

Here I use stumpy.stumped to find the matrix profile
This is my time-series(Entire data)
Entire_data= spark.sql('SELECT Closw from rk02_eurusd_candlestick_1_m_bid_01_01_2018_31_12_2018_csv where Closw is not null order by Gmt_time').toPandas()
ECG_data= Entire_data['Close']
ECG_data_values=ECG_data.values
print('The ecg data values are \n',ECG_data_values)
output:
The ecg data values are
[1.19985 1.19985 1.19985 ... 1.14627 1.14627 1.14627]
The time series have minutes data for the year 2018 I calculate matrix profile with 60 minute window for this data using Stumpy.stumped
start = time.time()
mp =stumpy.stumped(dask_client,ECG_data_values,60)
end = time.time()
print('mp is \n',mp)
print('timetaken for stumpy \n', end - start)
Output:
mp is
[[0.05810642702033023 504378 -1 504378]
[0.04393523046509543 345806 -1 345806]
[0.07967330055954358 504378 -1 504378]
...
[8.9131475674334 441593 441593 -1]
[8.9131475674334 441594 441594 -1]
[8.9131475674334 441595 441595 -1]]
timetaken for stumpy
2083.619256258011
mparr = np.array(mp)
print(mparr)
mparr1 = ((mparr[:, 0], mparr[:, 1]))
print(mparr1)
Output:
(array([0.05810642702033023, 0.04393523046509543, 0.07967330055954358, ...,
8.9131475674334, 8.9131475674334, 8.9131475674334], dtype=object), array([504378, 345806, 504378, ..., 441593, 441594, 441595], dtype=object))
This is where I get problem, I'm trying find out 10 motifs
motifs ,motif_distance = motifs.motifs(ECG_data_values,mparr1, max_motifs=10)
print('top motifs: \n',motifs)
We expect 10 motifs i.e, 10 lists but it returns only 4 Lists(motifs) is this correct? Have I done any mistake?
Output:
[[356, 342497], [8, 39, 410,.......525115, 525178, 525209, 525301, 525332, 525366, 525397], [201362, 332401], [9901, 40141, 120721, 161041, 382802, 413041, 463501, 503821, 516782]]
pls help me with this, Thanks in advance!

Pandas .describe() returns wrong column values in table

Look at the gld_weight column of figure 1. It is throwing off completely wrong values. The btc_weight + gld_weight should always adds up to 1. But why is the gld_weight column not corresponding to the returned row values when I used the describe function?
Figure 1:
Figure 2:
Figure 3:
This is my source code:
import numpy as np
import pandas as pd
from pandas_datareader import data as wb
import matplotlib.pyplot as plt
assets = ['BTC-USD', 'GLD']
mydata = pd.DataFrame()
for asset in assets:
mydata[asset] = wb.DataReader(asset, data_source='yahoo', start='2015-1-1')['Close']
cleandata = mydata.dropna()
log_returns = np.log(cleandata/cleandata.shift(1))
annual_log_returns = log_returns.mean() * 252 * 100
annual_log_returns
annual_cov = log_returns.cov() * 252
annual_cov
pfolio_returns = []
pfolio_volatility = []
btc_weight = []
gld_weight = []
for x in range(1000):
weights = np.random.random(2)
weights[0] = weights[0]/np.sum(weights)
weights[1] = weights[1]/np.sum(weights)
weights /= np.sum(weights)
btc_weight.append(weights[0])
gld_weight.append(weights[1])
pfolio_returns.append(np.dot(annual_log_returns, weights))
pfolio_volatility.append(np.sqrt(np.dot(weights.T, np.dot(annual_cov, weights))))
pfolio_returns
pfolio_volatility
npfolio_returns = np.array(pfolio_returns)
npfolio_volatility = np.array(pfolio_volatility)
new_portfolio = pd.DataFrame({
'Returns': npfolio_returns,
'Volatility': npfolio_volatility,
'btc_weight': btc_weight,
'gld_weight': gld_weight
})
I'am not 100% sure i got your question correctly, but an issue might be, that you are not reassigning the output to new variable, therefore not saving it.
Try to adjust your code in this matter:
new_portfolio = new_portfolio.sort_values(by="Returns")
Or turn inplace parameter to True - link
Short answer :
The issue at hand was found in the for-loop were the initial weight value normalization was done. How its fixed: see update 1 below in the answer.
Background to getting the solution:
At first glance the code of OP seemed to be in order and values in the arrays were fitted as expected by the requests OP made via the written codes. From testing it appeared that with range(1000) was asking for trouble because value-outcome oversight was lost due to the vast amount of "randomness" results. Especially as the question was written as a transformation issue. So x/y axis values mixing or some other kind of transformation error was hard to study.
To tackle this I used static values as can be seen for annual_log_returns and annual_cov.
Then I've locked all outputs for print so the values become locked in place and can't be changed further down the processing. .. it was possible that the prints of code changed during run-time because the arrays were not locked (also suggested by Pavel Klammert in his answer).
After commented feedback I've figured out what OP meant with "the values are wrong. I then focused on the method how the used values, to fill the arrays, were created.
The issue of "throwing wrong values was found :
The use of weights[0] = weights[0]/np.sum(weights) replaces the original list weights[0] value for new weights[0] which then serves as new input for weights[1] = weights[1]/np.sum(weights) and therefore sum = 1 is never reached.
The variable names weights[0] and weights[1] were then changed into 'a' and 'b' at two places directly after the creation of weights [0] and [1] values to prevent overwriting the initial weights values. Then the outcome is as "planned".
Problem solved.
import numpy as np
import pandas as pd
pfolio_returns = []
pfolio_volatility = []
btc_weight = []
gld_weight = []
annual_log_returns = [0.69, 0.71]
annual_cov = 0.73
ranger = 5
for x in range(ranger):
weights = np.random.random(2)
weights[0] = weights[0]/np.sum(weights)
weights[1] = weights[1]/np.sum(weights)
weights /= np.sum(weights)
btc_weight.append(weights[0])
gld_weight.append(weights[1])
pfolio_returns.append(np.dot(annual_log_returns, weights))
pfolio_volatility.append(np.sqrt(np.dot(weights.T, np.dot(annual_cov, weights))))
print (weights[0])
print (weights[1])
print (weights)
#print (pfolio_returns)
#print (pfolio_volatility)
npfolio_returns = np.array(pfolio_returns)
npfolio_volatility = np.array(pfolio_volatility)
#df = pd.DataFrame(array, index = row_names, columns=colomn_names, dtype = dtype)
new_portfolio = pd.DataFrame({'Returns': npfolio_returns, 'Volatility': npfolio_volatility, 'btc_weight': btc_weight, 'gld_weight': gld_weight})
print (new_portfolio, '\n')
sort = new_portfolio.sort_values(by='Returns')
sort_max_gld_weight = sort.loc[ranger-1, 'gld_weight']
print ('Sort:\n', sort, '\n')
print ('sort max_gld_weight : "%s"\n' % sort_max_gld_weight) # if "999" contains the highest gld_weight... but most cases its not!
sort_max_gld_weight = sort.max(axis=0)[3] # this returns colomn 4 'gld_weight' value.
print ('sort max_gld_weight : "%s"\n' % sort_max_gld_weight) # this returns colomn 4 'gld_weight' value.
desc = new_portfolio.describe()
desc_max_gld_weight =desc.loc['max', 'gld_weight']
print ('Describe:\n', desc, '\n')
print ('desc max_gld_weight : "%s"\n' % desc_max_gld_weight)
max_val_gld = new_portfolio.loc[new_portfolio['gld_weight'] == sort_max_gld_weight]
print('max val gld:\n', max_val_gld, '\n')
locations = new_portfolio.loc[new_portfolio['gld_weight'] > 0.99]
print ('location:\n', locations)
Result can be for example:
0.9779586087178525
0.02204139128214753
[0.97795861 0.02204139]
Returns Volatility btc_weight gld_weight
0 0.702820 0.627707 0.359024 0.640976
1 0.709807 0.846179 0.009670 0.990330
2 0.708724 0.801756 0.063786 0.936214
3 0.702010 0.616237 0.399496 0.600504
4 0.690441 0.835780 0.977959 0.022041
Sort:
Returns Volatility btc_weight gld_weight
4 0.690441 0.835780 0.977959 0.022041
3 0.702010 0.616237 0.399496 0.600504
0 0.702820 0.627707 0.359024 0.640976
2 0.708724 0.801756 0.063786 0.936214
1 0.709807 0.846179 0.009670 0.990330
sort max_gld_weight : "0.02204139128214753"
sort max_gld_weight : "0.9903300366638084"
Describe:
Returns Volatility btc_weight gld_weight
count 5.000000 5.000000 5.000000 5.000000
mean 0.702760 0.745532 0.361987 0.638013
std 0.007706 0.114057 0.385321 0.385321
min 0.690441 0.616237 0.009670 0.022041
25% 0.702010 0.627707 0.063786 0.600504
50% 0.702820 0.801756 0.359024 0.640976
75% 0.708724 0.835780 0.399496 0.936214
max 0.709807 0.846179 0.977959 0.990330
desc max_gld_weight : "0.9903300366638084"
max val gld:
Returns Volatility btc_weight gld_weight
1 0.709807 0.846179 0.00967 0.99033
loacation:
Returns Volatility btc_weight gld_weight
1 0.709807 0.846179 0.00967 0.99033
Update 1 :
for x in range(ranger):
weights = np.random.random(2)
print (weights)
a = weights[0]/np.sum(weights) # see comments below.
print (weights[0])
b = weights[1]/np.sum(weights) # see comments below.
print (weights[1])
print ('w0 + w1=', weights[0] + weights[1])
weights /= np.sum(weights)
btc_weight.append(a)
gld_weight.append(b)
print('a=', a, 'b=',b , 'a+b=', a+b)
The new output becomes for example:
[0.37710183 0.72933416]
0.3771018292953062
0.7293341569809412
w0 + w1= 1.1064359862762474
a= 0.34082570882790686 b= 0.6591742911720931 a+b= 1.0
[0.09301326 0.05296838]
0.09301326441107827
0.05296838430180717
w0 + w1= 0.14598164871288544
a= 0.637157240181712 b= 0.3628427598182879 a+b= 1.0
[0.48501305 0.56078073]
0.48501305100305336
0.5607807281299131
w0 + w1= 1.0457937791329663
a= 0.46377503928658087 b= 0.5362249607134192 a+b= 1.0
[0.41271663 0.89734662]
0.4127166254704412
0.8973466186511199
w0 + w1= 1.3100632441215612
a= 0.31503564986069105 b= 0.6849643501393089 a+b= 1.0
[0.11854074 0.57862593]
0.11854073835784273
0.5786259314340823
w0 + w1= 0.697166669791925
a= 0.1700321364950252 b= 0.8299678635049749 a+b= 1.0
Results printed outside the for-loop:
0.1700321364950252
0.8299678635049749
[0.17003214 0.82996786]

How do I set the minimum and maximum length of dataframes in hypothesis?

I have the following strategy for creating dataframes with genomics data:
from hypothesis.extra.pandas import columns, data_frames, column
import hypothesis.strategies as st
def mysort(tp):
key = [-1, tp[1], tp[2], int(1e10)]
return [x for _, x in sorted(zip(key, tp))]
positions = st.integers(min_value=0, max_value=int(1e7))
strands = st.sampled_from("+ -".split())
chromosomes = st.sampled_from(elements=["chr{}".format(str(e)) for e in list(range(1, 23)) + "X Y M".split()])
genomics_data = data_frames(columns=columns(["Chromosome", "Start", "End", "Strand"], dtype=int),
rows=st.tuples(chromosomes, positions, positions, strands).map(mysort))
I am not really interested in empty dataframes as they are invalid, and I would also like to produce some really long dfs. How do I change the sizes of the dataframes created for test cases? I.e. min size 1, avg size large?
You can give the data_frames constructor an index argument which has min_size and max_size options:
from hypothesis.extra.pandas import data_frames, columns, range_indexes
import hypothesis.strategies as st
def mysort(tp):
key = [-1, tp[1], tp[2], int(1e10)]
return [x for _, x in sorted(zip(key, tp))]
chromosomes = st.sampled_from(["chr{}".format(str(e)) for e in list(range(1, 23)) + "X Y M".split()])
positions = st.integers(min_value=0, max_value=int(1e7))
strands = st.sampled_from("+ -".split())
dfs = data_frames(index=range_indexes(min_size=5), columns=columns("Chromosome Start End Strand".split(), dtype=int), rows=st.tuples(chromosomes, positions, positions, strands).map(mysort))
Produces dfs like:
Chromosome Start End Strand
0 chr11 1411202 8025685 +
1 chr18 902289 5026205 -
2 chr12 5343877 9282475 +
3 chr16 2279196 8294893 -
4 chr14 1365623 6192931 -
5 chr12 4602782 9424442 +
6 chr10 136262 1739408 +
7 chr15 521644 4861939 +

Resources