Identify similar numbers from several lists - python-3.x

I have 3 lists:
r=[0.611695403733703, 0.833193902333201, 1.09120811998494]
g=[0.300675698437847, 0.612539072191236, 1.18046695352626]
b=[0.00668849762984564, 0.611946522017357, 1.16778502636141]
I want to calculate the average of the most similar numbers. In the example above, r[0], g[1] and b[1] are very similar (approximately 0.61...). How can I identify this kind of pattern?

Brute force using list comprehensions:
r=[0.611695403733703, 0.833193902333201, 1.09120811998494]
g=[0.300675698437847, 0.612539072191236, 1.18046695352626]
b=[0.00668849762984564, 0.611946522017357, 1.16778502636141]
rg = [ (idx_r, idx_g,r,g) if abs(rr-gg) < 0.001 else None
for idx_r,rr in enumerate(r)
for idx_g, gg in enumerate(g)]
rb = [ (idx_r, idx_b,r,b) if abs(rr-bb) < 0.001 else None
for idx_r,rr in enumerate(r)
for idx_b, bb in enumerate(b)]
gb = [ (idx_g, idx_b,g,b) if abs(gg-bb) < 0.001 else None
for idx_g,gg in enumerate(g)
for idx_b, bb in enumerate(b)]
print(filter(None,rg+rb+gb))
Output:
[(0, 1, [0.611695403733703, 0.833193902333201, 1.09120811998494],
[0.300675698437847, 0.612539072191236, 1.18046695352626]),
(0, 1, [0.611695403733703, 0.833193902333201, 1.09120811998494],
[0.00668849762984564, 0.611946522017357, 1.16778502636141]),
(1, 1, [0.300675698437847, 0.612539072191236, 1.18046695352626],
[0.00668849762984564, 0.611946522017357, 1.16778502636141])]
Output are tuples of index in 1. list, index in 2. list and both lists.

You are looking to compute the distance between all sets of points. Best way to do this is scipy.spatial.distance.cdist:
from scipy.spatial.distance import cdist
import numpy as np
r=[0.611695403733703, 0.833193902333201, 1.09120811998494]
g=[0.300675698437847, 0.612539072191236, 1.18046695352626]
b=[0.00668849762984564, 0.611946522017357, 1.16778502636141]
arr = np.array([r,g,b])
# need 2d set of points
arr_flat = arr.ravel()[:, np.newaxis]
# computes distance between every point, pairwise
dists = cdist(arr_flat, arr_flat)
# (1,2) is the same as (2,1), so only consider each pair once
# ie. use upper triangle
dists = np.triu(dists)
# set 0 values to inf so we don't consider the,m
dists[dists == 0] = np.inf
# get all pairs that are below this threshold level
ahold = 0.01
coords = np.nonzero(dists<thold)
labels = 'rgb'
print(f'Pairs of points closer than {thold}:')
for i, j in zip(*coords):
print(labels[i//3] + f'[{i%3}]', labels[j//3] + f'[{j%3}]')
>>> Pairs of points closer than 0.01:
r[0] g[1]
r[0] b[1]
g[1] b[1]
# can easily count the number of points as
np.count_nonzero(dists<thold)
>>> 3

Related

How to remove repeating and empty or unmarked values on subplot of x-axis

I'm developing a set of graphs to paint some Pandas DataFrame values. For that I'm using various pandas, numpy and matplotlib modules and functions using the following code:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import matplotlib.ticker as ticker
data = {'Name': ['immoControlCmd', 'BrkTerrMde', 'GlblClkYr', 'HsaStat', 'TesterPhysicalResGWM', 'FapLc','FirstRowBuckleDriver', 'GlblClkDay'],
'Value': [0, 5, 0, 4, 0, 1, 1, 1],
'Id_Par': [0, 0, 3, 3, 3, 3, 0, 0]
}
signals_df = pd.DataFrame(data)
def plot_signals(signals_df):
# Count signals by par
signals_df['Count'] = signals_df.groupby('Id_Par').cumcount().add(1).mask(signals_df['Id_Par'].eq(0), 0)
# Subtract Par values from the index column
signals_df['Sub'] = signals_df.index - signals_df['Count']
id_par_prev = signals_df['Id_Par'].unique()
id_par = np.delete(id_par_prev, 0)
signals_df['Prev'] = [1 if x in id_par else 0 for x in signals_df['Id_Par']]
signals_df['Final'] = signals_df['Prev'] + signals_df['Sub']
# signals_df['Finall'] = signals_df['Final'].unique()
# print(signals_df['Finall'])
# Convert and set Subtract to index
signals_df.set_index('Final', inplace=True)
# pos_x = len(signals_df.index.unique()) - 1
# print(pos_x)
# Get individual names and variables for the chart
names_list = [name for name in signals_df['Name'].unique()]
num_names_list = len(names_list)
num_axis_x = len(signals_df["Name"])
# Creation Graphics
fig, ax = plt.subplots(nrows=num_names_list, figsize=(10, 10), sharex=True)
plt.xticks(np.arange(0, num_axis_x), color='SteelBlue', fontweight='bold')
for pos, (a_, name) in enumerate(zip(ax, names_list)):
# Get data
data = signals_df[signals_df["Name"] == name]["Value"]
# Get values axis-x and axis-y
x_ = np.hstack([-1, data.index.values, len(signals_df) - 1])
# print(data.index.values)
y_ = np.hstack([0, data.values, data.iloc[-1]])
# Plotting the data by position
ax[pos].plot(x_, y_, drawstyle='steps-post', marker='*', markersize=8, color='k', linewidth=2)
ax[pos].set_ylabel(name, fontsize=8, fontweight='bold', color='SteelBlue', rotation=30, labelpad=35)
ax[pos].yaxis.set_major_formatter(ticker.FormatStrFormatter('%0.1f'))
ax[pos].yaxis.set_tick_params(labelsize=6)
ax[pos].grid(alpha=0.4, color='SteelBlue')
plt.show()
plot_signals(signals_df)
What I want is to remove the points or positions of the x-axis where nothing is painted or they are not marked on the graph, but leave the values ​​and names as in the image at the end; Seen from Pandas it would be the "Final" column that, before painting the subplots, assigned it as an index and it is where some of the values ​​in this column are repeated; would be to remove the values ​​enclosed in the red box from the graph, but leave the values ​​and names as in the image at the end:
Name Value Id_Par Count Sub Prev
Final
0 immoControlCmd 0 0 0 0 0
1 BrkTerrMde 5 0 0 1 0
2 GlblClkYr 0 3 1 1 1
2 HsaStat 4 3 2 1 1
2 TesterPhysicalResGWM 0 3 3 1 1
2 FapLc 1 3 4 1 1
6 FirstRowBuckleDriver 1 0 0 6 0
7 GlblClkDay 1 0 0 7 0
I've been trying to bring the unique values ​​of the last column, which would be the value that the x-axis should be, but since the dataframe is of another size or dimension, I get an error: ValueError: Length of values ​​(5) does not match length of index (8), and then I have to resize my chart, but in this case I don't understand how to do it:
signals_df['Final'] = signals_df['Prev'] + signals_df['Sub']
signals_df['Finall'] = signals_df['Final'].unique()
print(signals_df['Finall'])
I've also tried to bring the size of the unique index, previously assigned to apply a subtraction to data.index.values ​​of the variable x_, but it does not bring me what I want because it is gathering all the values ​​and subtracting them in bulk and not separately , as is data.index.values:
signals_df.set_index('Final', inplace=True)
pos_x = len(signals_df.index.unique()) - 1
...
..
.
x_ = np.hstack([-1, data.index.values-pos-x, len(signals_df) - 1])
Is there a Pandas and/or Matplotlib function that allows me? or Could someone give me a suggestion that will help me better understand how to do it? what I expect to achieve would be the plot below:
I really appreciate your help, any comments help.
I've Python version: 3.6.5, Pandas version: 1.1.5 and Matplotlib version: 3.3.2
One possible way to do this is if you make your x-axis values into strings, which means that matplotlib will make a "categorical" plot. See examples of that here.
For your case, because you have subplots which would have different values, and they are not always in the right order, we need to do a bit of trickery first to make sure the ticks appear in the correct order. For that, we can use the approach from this answer, where they plot something that uses all of the x values in the correct order, and then remove it.
To gather all the xtick values together, you can do something like this, where you create a list of the values, reduce it to the unique values using a set, then sort those values, and convert to strings using a list comprehension and str():
# First make a list of all the xticks we want
xvals = [-1,]
for name in names_list:
xvals.append(signals_df[signals_df["Name"] == name]["Value"].index.values[0])
xvals.append(len(signals_df)-1)
# Reduce to only unique values, sorted, and then convert to strings
xvals = [str(i) for i in sorted(set(xvals))]
Once you have those, you can make a dummy plot, and then remove it, like so (this is to fix the tick positions in the correct order). NOTE that this needs to be inside your plotting loop for matplotlib versions 3.3.4 and earlier:
# To get the ticks in the right order on all subplots, we need to make
# a dummy plot here and then remove it
dummy, = ax[0].plot(xvals, np.zeros_like(xvals))
dummy.remove()
Finally, when you actually plot the real data inside the loop, you just need to convert x_ to strings as you plot them:
ax[pos].plot(x_.astype('str'), y_, drawstyle='steps-post', marker='*', markersize=8, color='k', linewidth=2)
Note the only other change I made was to not explicitly set the xtick positions (which you did, with plt.xticks), but you can still use that command to set the font colour and weight
plt.xticks(color='SteelBlue', fontweight='bold')
And this is the output:
For completeness, here I have put it all together in your script:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import matplotlib.ticker as ticker
import matplotlib
print(matplotlib.__version__)
data = {'Name': ['immoControlCmd', 'BrkTerrMde', 'GlblClkYr', 'HsaStat', 'TesterPhysicalResGWM', 'FapLc',
'FirstRowBuckleDriver', 'GlblClkDay'],
'Value': [0, 5, 0, 4, 0, 1, 1, 1],
'Id_Par': [0, 0, 3, 3, 3, 3, 0, 0]
}
signals_df = pd.DataFrame(data)
def plot_signals(signals_df):
# Count signals by par
signals_df['Count'] = signals_df.groupby('Id_Par').cumcount().add(1).mask(signals_df['Id_Par'].eq(0), 0)
# Subtract Par values from the index column
signals_df['Sub'] = signals_df.index - signals_df['Count']
id_par_prev = signals_df['Id_Par'].unique()
id_par = np.delete(id_par_prev, 0)
signals_df['Prev'] = [1 if x in id_par else 0 for x in signals_df['Id_Par']]
signals_df['Final'] = signals_df['Prev'] + signals_df['Sub']
# signals_df['Finall'] = signals_df['Final'].unique()
# print(signals_df['Finall'])
# Convert and set Subtract to index
signals_df.set_index('Final', inplace=True)
# pos_x = len(signals_df.index.unique()) - 1
# print(pos_x)
# Get individual names and variables for the chart
names_list = [name for name in signals_df['Name'].unique()]
num_names_list = len(names_list)
num_axis_x = len(signals_df["Name"])
# Creation Graphics
fig, ax = plt.subplots(nrows=num_names_list, figsize=(10, 10), sharex=True)
# No longer any need to define where the ticks go, but still set the colour and weight here
plt.xticks(color='SteelBlue', fontweight='bold')
# First make a list of all the xticks we want
xvals = [-1, ]
for name in names_list:
xvals.append(signals_df[signals_df["Name"] == name]["Value"].index.values[0])
xvals.append(len(signals_df) - 1)
# Reduce to only unique values, sorted, and then convert to strings
xvals = [str(i) for i in sorted(set(xvals))]
for pos, (a_, name) in enumerate(zip(ax, names_list)):
# To get the ticks in the right order on all subplots,
# we need to make a dummy plot here and then remove it
dummy, = ax[pos].plot(xvals, np.zeros_like(xvals))
dummy.remove()
# Get data
data = signals_df[signals_df["Name"] == name]["Value"]
# Get values axis-x and axis-y
x_ = np.hstack([-1, data.index.values, len(signals_df) - 1])
y_ = np.hstack([0, data.values, data.iloc[-1]])
# Plotting the data by position
# NOTE: here we convert x_ to strings as we plot, to make sure they are plotted as catagorical values
ax[pos].plot(x_.astype('str'), y_, drawstyle='steps-post', marker='*', markersize=8, color='k', linewidth=2)
ax[pos].set_ylabel(name, fontsize=8, fontweight='bold', color='SteelBlue', rotation=30, labelpad=35)
ax[pos].yaxis.set_major_formatter(ticker.FormatStrFormatter('%0.1f'))
ax[pos].yaxis.set_tick_params(labelsize=6)
ax[pos].grid(alpha=0.4, color='SteelBlue')
plt.show()
plot_signals(signals_df)

How do I set the minimum and maximum length of dataframes in hypothesis?

I have the following strategy for creating dataframes with genomics data:
from hypothesis.extra.pandas import columns, data_frames, column
import hypothesis.strategies as st
def mysort(tp):
key = [-1, tp[1], tp[2], int(1e10)]
return [x for _, x in sorted(zip(key, tp))]
positions = st.integers(min_value=0, max_value=int(1e7))
strands = st.sampled_from("+ -".split())
chromosomes = st.sampled_from(elements=["chr{}".format(str(e)) for e in list(range(1, 23)) + "X Y M".split()])
genomics_data = data_frames(columns=columns(["Chromosome", "Start", "End", "Strand"], dtype=int),
rows=st.tuples(chromosomes, positions, positions, strands).map(mysort))
I am not really interested in empty dataframes as they are invalid, and I would also like to produce some really long dfs. How do I change the sizes of the dataframes created for test cases? I.e. min size 1, avg size large?
You can give the data_frames constructor an index argument which has min_size and max_size options:
from hypothesis.extra.pandas import data_frames, columns, range_indexes
import hypothesis.strategies as st
def mysort(tp):
key = [-1, tp[1], tp[2], int(1e10)]
return [x for _, x in sorted(zip(key, tp))]
chromosomes = st.sampled_from(["chr{}".format(str(e)) for e in list(range(1, 23)) + "X Y M".split()])
positions = st.integers(min_value=0, max_value=int(1e7))
strands = st.sampled_from("+ -".split())
dfs = data_frames(index=range_indexes(min_size=5), columns=columns("Chromosome Start End Strand".split(), dtype=int), rows=st.tuples(chromosomes, positions, positions, strands).map(mysort))
Produces dfs like:
Chromosome Start End Strand
0 chr11 1411202 8025685 +
1 chr18 902289 5026205 -
2 chr12 5343877 9282475 +
3 chr16 2279196 8294893 -
4 chr14 1365623 6192931 -
5 chr12 4602782 9424442 +
6 chr10 136262 1739408 +
7 chr15 521644 4861939 +

Python Matrix Multiplication - Append into empty list

How do I generate random matrices and get them multiplied in an efficient way.
This is what I've done:
`mat1 = []
for i in range(0, order):
num1 = random.sample(range(1,10), order)
print(num1)
mat1.append(num1)
print()
print("Result of Matrix Multiplication.")
for p in range(len(mat1)):
for q in range(len(mat2[0])):
for r in range(len(mat2)):
res_matrix[p][q] += mat1[p][r] * mat2[r][q]
for res in res_matrix:
print(res)`
You can use list comprehension to generate res_matrix using
res_matrix = [[0 for i in range(order)] for j in range(order)]
Also, have you heard of numpy? It does this kind of computations (and many more) in an easy and very fast way. This is what your code would become with numpy
import numpy as np
print("Generate 1st Matrix")
mat1 = np.random.randint(1, 10, size=(order, order))
print(mat1)
print("Generate 2nd Matrix")
mat2 = np.random.randint(1, 10, size=(order, order))
print(mat2)
res_matrix = mat1.dot(mat2)
print("Result of Matrix Multiplication.")
print(res_matrix)

How can I compare two lists of numpy vectors?

I have two lists of numpy vectors and wish to determine whether they represent approximately the same points (but possibly in a different order).
I've found methods such as numpy.testing.assert_allclose but it doesn't allow for possibly different orders. I have also found unittest.TestCase.assertCountEqual but that doesn't work with numpy arrays!
What is my best approach?
import unittest
import numpy as np
first = [np.array([20, 40]), np.array([20, 60])]
second = [np.array([19.8, 59.7]), np.array([20.1, 40.5])]
np.testing.assert_all_close(first, second, atol=2) # Fails because the orders are different
unittest.TestCase.assertCountEqual(None, first, second) # Fails because numpy comparisons evaluate element-wise; and because it doesn't allow a tolerance
A nice list iteration approach
In [1047]: res = []
In [1048]: for i in first:
...: for j in second:
...: diff = np.abs(i-j)
...: if np.all(diff<2):
...: res.append((i,j))
In [1049]: res
Out[1049]:
[(array([20, 40]), array([ 20.1, 40.5])),
(array([20, 60]), array([ 19.8, 59.7]))]
Length of res is the number of matches.
Or as list comprehension:
def match(i,j):
diff = np.abs(i-j)
return np.all(diff<2)
In [1051]: [(i,j) for i in first for j in second if match(i,j)]
Out[1051]:
[(array([20, 40]), array([ 20.1, 40.5])),
(array([20, 60]), array([ 19.8, 59.7]))]
or with the existing array test:
[(i,j) for i in first for j in second if np.allclose(i,j, atol=2)]
Here you are :)
( idea based on
Euclidean distance between points in two different Numpy arrays, not within )
import numpy as np
import scipy.spatial
first = [np.array([20 , 60 ]), np.array([ 20, 40])]
second = [np.array([19.8, 59.7]), np.array([20.1, 40.5])]
def pointsProximityCheck(firstListOfPoints, secondListOfPoints, distanceTolerance):
pointIndex = 0
maxDistance = 0
lstIndices = []
for item in scipy.spatial.distance.cdist( firstListOfPoints, secondListOfPoints ):
currMinDist = min(item)
if currMinDist > maxDistance:
maxDistance = currMinDist
if currMinDist < distanceTolerance :
pass
else:
lstIndices.append(pointIndex)
# print("point with pointIndex [", pointIndex, "] in the first list outside of Tolerance")
pointIndex+=1
return (maxDistance, lstIndices)
maxDistance, lstIndicesOfPointsOutOfTolerance = pointsProximityCheck(first, second, distanceTolerance=0.5)
print("maxDistance:", maxDistance, "indicesOfOutOfTolerancePoints", lstIndicesOfPointsOutOfTolerance )
gives on output with distanceTolerance=0.5 :
maxDistance: 0.509901951359 indicesOfOutOfTolerancePoints [1]
but possibly in a different order
This is the key requirement. This problem can be treat as a classic problem in graph theory - finding perfect matching in unweighted bipartite graph. Hungarian Algorithm is a classic algo to solve this problem.
Here I implemented one.
import numpy as np
def is_matched(first, second):
checked = np.empty((len(first),), dtype=bool)
first_matching = [-1] * len(first)
second_matching = [-1] * len(second)
def find(i):
for j, point in enumerate(second):
if np.allclose(first[i], point, atol=2):
if not checked[j]:
checked[j] = True
if second_matching[j] == -1 or find(second_matching[j]):
second_matching[j] = i
first_matching[i] = j
return True
def get_max_matching():
count = 0
for i in range(len(first)):
if first_matching[i] == -1:
checked.fill(False)
if find(i):
count += 1
return count
return len(first) == len(second) and get_max_matching() == len(first)
first = [np.array([20, 40]), np.array([20, 60])]
second = [np.array([19.8, 59.7]), np.array([20.1, 40.5])]
print(is_matched(first, second))
# True
first = [np.array([20, 40]), np.array([20, 60])]
second = [np.array([19.8, 59.7]), np.array([20.1, 43.5])]
print(is_matched(first, second))
# False

Missing value imputation in Python

I have two huge vectors item_clusters and beta. The element item_clusters [ i ] is the cluster id to which the item i belongs. The element beta [ i ] is a score given to the item i. Scores are {-1, 0, 1, 2, 3}.
Whenever the score of a particular item is 0, I have to impute that with the average non-zero score of other items belonging to the same cluster. What is the fastest possible way to to this?
This is what I have tried so far. I converted the item_clusters to a matrix clusters_to_items such that the element clusters_to_items [ i ][ j ] = 1 if the cluster i contains item j, else 0. After that I am running the following code.
# beta (1x1.3M) csr matrix
# num_clusters = 1000
# item_clusters (1x1.3M) numpy.array
# clust_to_items (1000x1.3M) csr_matrix
alpha_z = []
for clust in range(0, num_clusters):
alpha = clust_to_items[clust, :]
alpha_beta = beta.multiply(alpha)
sum_row = alpha_beta.sum(1)[0, 0]
num_nonzero = alpha_beta.nonzero()[1].__len__() + 0.001
to_impute = sum_row / num_nonzero
Z = np.repeat(to_impute, beta.shape[1])
alpha_z = alpha.multiply(Z)
idx = beta.nonzero()
alpha_z[idx] = beta.data
interact_score = alpha_z.tolist()[0]
# The interact_score is the required modified beta
# This is used to do some work that is very fast
The problem is that this code has to run 150K times and it is very slow. It will take 12 days to run according to my estimate.
Edit: I believe, I need some very different idea in which I can directly use item_clusters, and do not need to iterate through each cluster separately.
I don't know if this means I'm the popular kid here or not, but I think you can vectorize your operations in the following way:
def fast_impute(num_clusters, item_clusters, beta):
# get counts
cluster_counts = np.zeros(num_clusters)
np.add.at(cluster_counts, item_clusters, 1)
# get complete totals
totals = np.zeros(num_clusters)
np.add.at(totals, item_clusters, beta)
# get number of zeros
zero_counts = np.zeros(num_clusters)
z = beta == 0
np.add.at(zero_counts, item_clusters, z)
# non-zero means
cluster_means = totals / (cluster_counts - zero_counts)
# perform imputations
imputed_beta = np.where(beta != 0, beta, cluster_means[item_clusters])
return imputed_beta
which gives me
>>> N = 10**6
>>> num_clusters = 1000
>>> item_clusters = np.random.randint(0, num_clusters, N)
>>> beta = np.random.choice([-1, 0, 1, 2, 3], size=len(item_clusters))
>>> %time imputed = fast_impute(num_clusters, item_clusters, beta)
CPU times: user 652 ms, sys: 28 ms, total: 680 ms
Wall time: 679 ms
and
>>> imputed[:5]
array([ 1.27582017, -1. , -1. , 1. , 3. ])
>>> item_clusters[:5]
array([506, 968, 873, 179, 269])
>>> np.mean([b for b, i in zip(beta, item_clusters) if i == 506 and b != 0])
1.2758201701093561
Note that I did the above manually. It would be a lot easier if you were using higher-level tools, say like those provided by pandas:
>>> df = pd.DataFrame({"beta": beta, "cluster": item_clusters})
>>> df.head()
beta cluster
0 0 506
1 -1 968
2 -1 873
3 1 179
4 3 269
>>> df["beta"] = df["beta"].replace(0, np.nan)
>>> df["beta"] = df["beta"].fillna(df["beta"].groupby(df["cluster"]).transform("mean"))
>>> df.head()
beta cluster
0 1.27582 506
1 -1.00000 968
2 -1.00000 873
3 1.00000 179
4 3.00000 269
My suspicion is that
alpha_beta = beta.multiply(alpha)
is a terrible idea, because you only need the first elements of the row sums, so you're doing a couple million multiply-adds in vain, if I'm not mistaken:
sum_row = alpha_beta.sum(1)[0, 0]
So, write down the discrete formula for beta * alpha, then pick the row you need and derive the formula for its sum.

Resources