Finding the minimum value in a cluster of array using python - python-3.x

I need some help regarding the coding using python.
Here is the problem.
Let say I have an array (size = (50,50)) containing float numbers. I would like to find the minimum value for every cluster of cells (size = (10,10)). So in total, I will have 25 values.
This is what I did so far, maybe there is another way to do it so that the program could run faster since I need it to handle a quite big array (let say 1 mil x 1 mill of cells).
import numpy as np
import random
def mini_cluster(z,y,x):
a = []
for i in range(y,y+10):
for j in range(x,x+10):
a.append(z[i,j])
return min(a)
z = np.zeros(shape=(50,50))
for i in range (len(z)):
for j in range(len(z)):
z[i,j] = random.uniform(10,12.5)
mini = []
for i in range(0,len(z),10):
for j in range(0,len(z),10):
mini.append(mini_cluster(z,i,j))

I am not sure of its speed but using numpy slicing should simplify your work.
you can avoid all those for loops.
here is some sample code
import numpy as np
arr=[[1,2,3,8],[4,5,6,7],[8,9,10,11],[0,3,5,9]]
arr_np = np.array(arr)
print(arr_np)
cluster= arr_np[:3,:3]
print('\n')
print(cluster)
print('\n')
print(np.amin(cluster))
[[ 1 2 3 8]
[ 4 5 6 7]
[ 8 9 10 11]
[ 0 3 5 9]]
[[ 1 2 3]
[ 4 5 6]
[ 8 9 10]]
1
you can also check this tutorial

Related

Vectors in python

Hi i have just started using python and coding in general. This is the last question of my assignment and i honestly have no clue as to how to even start this question.
I need to write a program to do basic vector calculations in 3 dimensions:
addition, dot product and normalization.
I have no clue what to do after this step or if this step is even right please help.
The expected result is:
Enter vector A:
1 3 2
Enter vector B:
2 3 0
A+B = [3, 6, 2]
A.B = 11
|A| = 3.74
|B| = 3.61
Using numpy:
import numpy as np
A = np.array([1,3,2])
B = np.array([2,3,0])
# sum
print(A+B) # -> array([3, 6, 2])
# dot product
print(np.dot(A)) # -> 11
#normalization
print(np.linalg.norm(A)) # -> 3.741...
print(np.linalg.norm(B)) # -> 3.605...
Without numpy:
A = [1,3,2]
B = [2,3,0]
# sum
print([i+j for i,j in zip(A,B)])
# dot product
print(sum(i*j for i, j in zip(A,B)))
#normalization
print(sum(i**2 for i in A)**(0.5))
print(sum(i**2 for i in B)**(0.5))

How to return floating values using floor division

In Python 3, I want to return the units place of an integer value, then tens, then hundreds and so on. Suppose I have an integer 456, first I want to return 6, then 5 then 4. Is there any way? I tried floor division and for loop but didn't work.
If you look at the list of basic operators from the documentation, for example here,
Operator Description Example
% Modulus Divides left hand operand by right hand operand and returns remainder b % a = 1
// Floor Division - The division of operands where the result is the quotient in which the digits after the decimal point are removed. But if one of the operands is negative, the result is floored, i.e., rounded away from zero (towards negative infinity): 9//2 = 4 and 9.0//2.0 = 4.0, -11//3 = -4, -11.0//3 = -4.0
With that knowledge, you can get what you want as follows:
In [1]: a = 456
In [2]: a % 10
Out[2]: 6
In [3]: (a % 100) // 10
Out[3]: 5
In [4]: a // 100
Out[4]: 4
Write a generator if you want to retrieve digits in different places of your code based on requirement as follows.
If you are not much familiar with Python's generator, have a quick look at https://www.programiz.com/python-programming/generator.
» Here get_digits() is a generator.
def get_digits(n):
while str(n):
yield n % 10
n = n // 10
if not n:
break
digit = get_digits(1729)
print(next(digit)) # 9
print(next(digit)) # 2
print(next(digit)) # 7
print(next(digit)) # 1
» If you wish to iterate over digits, you can also do so as follows.
for digit in get_digits(74831965):
print(digit)
# 5
# 6
# 9
# 1
# 3
# 8
# 4
# 7
» Quick overview about its usage (On Python3's Interactive terminal).
>>> def letter(name):
... for ch in name:
... yield ch
...
>>>
>>> char = letter("RISHIKESH")
>>>
>>> next(char)
'R'
>>>
>>> "Second letter is my name is: " + next(char)
'Second letter is my name is: I'
>>>
>>> "3rd one: " + next(char)
'3rd one: S'
>>>
>>> next(char)
'H'
>>>

creating lists from row data

My input data has the following format
id offset code
1 3 21
1 3 24
1 5 21
2 1 84
3 5 57
3 5 21
3 5 92
3 10 83
3 10 21
I would like the output in the following format
id offset code
1 [3,5] [[21,24],[21]]
2 [1] [[84]]
3 [5,10] [[21,57,92],[21,83]]
The code that I have been able to come up with is shown below
import random, pandas
random.seed(10000)
param = dict(nrow=100, nid=10, noffset=8, ncode=100)
#param = dict(nrow=1000, nid=10, noffset=8, ncode=100)
#param = dict(nrow=100000, nid=1000, noffset=50, ncode=5000)
#param = dict(nrow=10000000, nid=10000, noffset=100, ncode=5000)
pd = pandas.DataFrame({
"id":random.choices(range(1,param["nid"]+1), k=param["nrow"]),
"offset":random.choices(range(param["noffset"]), k=param["nrow"])
})
pd["code"] = random.choices(range(param["ncode"]), k=param["nrow"])
pd = pd.sort_values(["id","offset","code"]).reset_index(drop=True)
tmp1 = pd.groupby(by=["id"])["offset"].apply(lambda x:list(set(x))).reset_index()
tmp2 = pd.groupby(by=["id","offset"])["code"].apply(lambda x:list(x)).reset_index().groupby(\
by=["id"], sort=True)["code"].apply(lambda x:list(x)).reset_index()
out = pandas.merge(tmp1, tmp2, on="id", sort=False)
It does give me the output that I want but is VERY slow when the dataframe is large. The dataframe that I have has over 40million rows. In the example
uncomment the fourth param statement and you will see how slow it is.
Can you please help with making this run faster?
(df.groupby(['id','offset']).code.apply(list).reset_index()
.groupby('id').agg(lambda x: x.tolist()))
Out[733]:
offset code
id
1 [3, 5] [[21, 24], [21]]
2 [1] [[84]]
3 [5, 10] [[57, 21, 92], [83, 21]]

Python Pandas: bootstrap confidence limits by row rather than entire dataframe

What I am trying to do is to get bootstrap confidence limits by row regardless of the number of rows and make a new dataframe from the output.I currently can do this for the entire dataframe, but not by row. The data I have in my actual program looks similar to what I have below:
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4
I want the new dataframe to look something like this with the lower and upper confidence limits:
0 1
0 1 2
1 1 5.5
2 1 4.5
3 1 4.2
The current generated output looks like this:
0 1
0 2.0 2.75
The python 3 code below generates a mock dataframe and generates the bootstrap confidence limits for the entire dataframe. The result is a new dataframe with just 2 values, a upper and a lower confidence limit rather than 4 sets of 2(one for each row).
import pandas as pd
import numpy as np
import scikits.bootstrap as sci
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
print(zz)
x= zz.dtypes
print(x)
a = pd.DataFrame(np.array(zz.values.tolist())[:, :, 0],zz.index, zz.columns)
print(a)
b = sci.ci(a)
b = pd.DataFrame(b)
b = b.T
print(b)
Thank you for any help.
scikits.bootstrap operates by assuming that data samples are arranged by row, not by column. If you want the opposite behavior, just use the transpose, and a statfunction that doesn't combine columns.
import pandas as pd
import numpy as np
import scikits.bootstrap as sci
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
print(zz)
x= zz.dtypes
print(x)
a = pd.DataFrame(np.array(zz.values.tolist())[:, :, 0],zz.index, zz.columns)
print(a)
b = sci.ci(a.T, statfunction=lambda x: np.average(x, axis=0))
print(b.T)
Below is the answer I ended up figuring out to create bootstrap ci by row.
import pandas as pd
import numpy as np
import numpy.random as npr
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
x= zz.dtypes
a = pd.DataFrame(np.array(zz.values.tolist())[:, :, 0],zz.index, zz.columns)
print(a)
def bootstrap(data, num_samples, statistic, alpha):
n = len(data)
idx = npr.randint(0, n, (num_samples, n))
samples = data[idx]
stat = np.sort(statistic(samples, 1))
return (stat[int((alpha/2.0)*num_samples)],
stat[int((1-alpha/2.0)*num_samples)])
cc = list(a.index.values) # informs generator of the number of rows
def bootbyrow(cc):
for xx in range(1):
xx = list(a.index.values)
for xx in range(len(cc)):
k = a.apply(lambda y: y[xx])
k = k.values
for xx in range(1):
kk = list(bootstrap(k,10000,np.mean,0.05))
yield list(kk)
abc = pd.DataFrame(list(bootbyrow(cc))) #bootstrap ci by row
# the next 4 just show that its working correctly
a0 = bootstrap((a.loc[0,].values),10000,np.mean,0.05)
a1 = bootstrap((a.loc[1,].values),10000,np.mean,0.05)
a2 = bootstrap((a.loc[2,].values),10000,np.mean,0.05)
a3 = bootstrap((a.loc[3,].values),10000,np.mean,0.05)
print(abc)
print(a0)
print(a1)
print(a2)
print(a3)

Missing value imputation in Python

I have two huge vectors item_clusters and beta. The element item_clusters [ i ] is the cluster id to which the item i belongs. The element beta [ i ] is a score given to the item i. Scores are {-1, 0, 1, 2, 3}.
Whenever the score of a particular item is 0, I have to impute that with the average non-zero score of other items belonging to the same cluster. What is the fastest possible way to to this?
This is what I have tried so far. I converted the item_clusters to a matrix clusters_to_items such that the element clusters_to_items [ i ][ j ] = 1 if the cluster i contains item j, else 0. After that I am running the following code.
# beta (1x1.3M) csr matrix
# num_clusters = 1000
# item_clusters (1x1.3M) numpy.array
# clust_to_items (1000x1.3M) csr_matrix
alpha_z = []
for clust in range(0, num_clusters):
alpha = clust_to_items[clust, :]
alpha_beta = beta.multiply(alpha)
sum_row = alpha_beta.sum(1)[0, 0]
num_nonzero = alpha_beta.nonzero()[1].__len__() + 0.001
to_impute = sum_row / num_nonzero
Z = np.repeat(to_impute, beta.shape[1])
alpha_z = alpha.multiply(Z)
idx = beta.nonzero()
alpha_z[idx] = beta.data
interact_score = alpha_z.tolist()[0]
# The interact_score is the required modified beta
# This is used to do some work that is very fast
The problem is that this code has to run 150K times and it is very slow. It will take 12 days to run according to my estimate.
Edit: I believe, I need some very different idea in which I can directly use item_clusters, and do not need to iterate through each cluster separately.
I don't know if this means I'm the popular kid here or not, but I think you can vectorize your operations in the following way:
def fast_impute(num_clusters, item_clusters, beta):
# get counts
cluster_counts = np.zeros(num_clusters)
np.add.at(cluster_counts, item_clusters, 1)
# get complete totals
totals = np.zeros(num_clusters)
np.add.at(totals, item_clusters, beta)
# get number of zeros
zero_counts = np.zeros(num_clusters)
z = beta == 0
np.add.at(zero_counts, item_clusters, z)
# non-zero means
cluster_means = totals / (cluster_counts - zero_counts)
# perform imputations
imputed_beta = np.where(beta != 0, beta, cluster_means[item_clusters])
return imputed_beta
which gives me
>>> N = 10**6
>>> num_clusters = 1000
>>> item_clusters = np.random.randint(0, num_clusters, N)
>>> beta = np.random.choice([-1, 0, 1, 2, 3], size=len(item_clusters))
>>> %time imputed = fast_impute(num_clusters, item_clusters, beta)
CPU times: user 652 ms, sys: 28 ms, total: 680 ms
Wall time: 679 ms
and
>>> imputed[:5]
array([ 1.27582017, -1. , -1. , 1. , 3. ])
>>> item_clusters[:5]
array([506, 968, 873, 179, 269])
>>> np.mean([b for b, i in zip(beta, item_clusters) if i == 506 and b != 0])
1.2758201701093561
Note that I did the above manually. It would be a lot easier if you were using higher-level tools, say like those provided by pandas:
>>> df = pd.DataFrame({"beta": beta, "cluster": item_clusters})
>>> df.head()
beta cluster
0 0 506
1 -1 968
2 -1 873
3 1 179
4 3 269
>>> df["beta"] = df["beta"].replace(0, np.nan)
>>> df["beta"] = df["beta"].fillna(df["beta"].groupby(df["cluster"]).transform("mean"))
>>> df.head()
beta cluster
0 1.27582 506
1 -1.00000 968
2 -1.00000 873
3 1.00000 179
4 3.00000 269
My suspicion is that
alpha_beta = beta.multiply(alpha)
is a terrible idea, because you only need the first elements of the row sums, so you're doing a couple million multiply-adds in vain, if I'm not mistaken:
sum_row = alpha_beta.sum(1)[0, 0]
So, write down the discrete formula for beta * alpha, then pick the row you need and derive the formula for its sum.

Resources