I would like to use matplotlib to draw a dendrogram without using scipy. A similar question has been posted here; however, the marked solution suggests using scipy and the links in the other answers suggesting using ETE do not work. Using this example, I have verified the accuracy of my own method (ie, not scipy method) to apply agglomerative hierarchical clustering using the single-linkage criterion.
Using the same example linked from above, I have the necessary parameters to create my own dendrogram. The original distance_matrix is given by:
.. DISTANCE MATRIX (SHAPE=(6, 6)):
[[ 0 662 877 255 412 996]
[662 0 295 468 268 400]
[877 295 0 754 564 0]
[255 468 754 0 219 869]
[412 268 564 219 0 669]
[996 400 0 869 669 0]]
A masked array of distance_matrix is used such that the diagonal entries from above are not counted as minimums. The mask of the original distance_matrix is given by:
.. MASKED (BEFORE) DISTANCE MATRIX (SHAPE=(6, 6)):
[[-- 662 877 255 412 996]
[662 -- 295 468 268 400]
[877 295 -- 754 564 0]
[255 468 754 -- 219 869]
[412 268 564 219 -- 669]
[996 400 0 869 669 --]]
distance_matrix is changed in-place at every iteration of the algorithm. Once the algorithm has completed, distance_matrix is given by:
.. MASKED (AFTER) DISTANCE MATRIX (SHAPE=(1, 1)):
[[--]]
The levels (minimum distance of each merger) are give by:
.. 5 LEVELS:
[138, 219, 255, 268, 295]
We can also view the indices of the merged datapoints at every iteration; these indices correspond to the original distance_matrix since reducing dimensions has the effect of changing index positions. These indices are given by:
.. 5x2 LOCATIONS:
[(2, 5), (3, 4), (0, 3), (0, 1), (0, 2)]
From these indices, the ordering of the xticklabels of the dendrogram are given chronologically as:
.. 6 XTICKLABELS
[2 5 3 4 0 1]
In relation to the linked example,
0 = BA
1 = FI
2 = MI
3 = NA
4 = RM
5 = TO
Using these parameters, I would like to generate a dendrogram that looks like the one below (borrowed from linked example):
My attempt at trying to replicate this dendrogram using matplotlib is below:
fig, ax = plt.subplots()
for loc, level in zip(locations, levels):
x = np.array(loc)
y = level * np.ones(x.size)
ax.step(x, y, where='mid')
ax.set_xticks(xticklabels)
# ax.set_xticklabels(xticklabels)
plt.show()
plt.close(fig)
My attempt above produces the following figure:
I realize I have to reorder the xticklabels such that the first merged points appear at the right-edge, with each subsequent merger shifting towards the left; doing so necessarily means adjusting the width of the connecting lines. Also, I was using ax.step instead of ax.bar so that the lines would appear more organized (as opposed to rectangular bars everywhere); the only thing I can think to do is to draw horizontal and vertical lines using ax.axhline and ax.axvline. I am hoping there is a simpler way to accomplish what I would like. Is there a straight-forward approach using matplotlib?
While it would certainly be easier to rely on scipy, this is how I'd do it "manually", i.e. without it:
import matplotlib.pyplot as plt
import numpy as np
def mk_fork(x0,x1,y0,y1,new_level):
points=[[x0,x0,x1,x1],[y0,new_level,new_level,y1]]
connector=[(x0+x1)/2.,new_level]
return (points),connector
levels=[138, 219, 255, 268, 295]
locations=[(2, 5), (3, 4), (0, 3), (0, 1), (0, 2)]
label_map={
0:{'label':'BA','xpos':0,'ypos':0},
1:{'label':'FI','xpos':3,'ypos':0},
2:{'label':'MI','xpos':4,'ypos':0},
3:{'label':'NA','xpos':1,'ypos':0},
4:{'label':'RM','xpos':2,'ypos':0},
5:{'label':'TO','xpos':5,'ypos':0},
}
fig,ax=plt.subplots()
for i,(new_level,(loc0,loc1)) in enumerate(zip(levels,locations)):
print('step {0}:\t connecting ({1},{2}) at level {3}'.format(i, loc0, loc1, new_level ))
x0,y0=label_map[loc0]['xpos'],label_map[loc0]['ypos']
x1,y1=label_map[loc1]['xpos'],label_map[loc1]['ypos']
print('\t points are: {0}:({2},{3}) and {1}:({4},{5})'.format(loc0,loc1,x0,y0,x1,y1))
p,c=mk_fork(x0,x1,y0,y1,new_level)
ax.plot(*p)
ax.scatter(*c)
print('\t connector is at:{0}'.format(c))
label_map[loc0]['xpos']=c[0]
label_map[loc0]['ypos']=c[1]
label_map[loc0]['label']='{0}/{1}'.format(label_map[loc0]['label'],label_map[loc1]['label'])
print('\t updating label_map[{0}]:{1}'.format(loc0,label_map[loc0]))
ax.text(*c,label_map[loc0]['label'])
_xticks=np.arange(0,6,1)
_xticklabels=['BA','NA','RM','FI','MI','TO']
ax.set_xticks(_xticks)
ax.set_xticklabels(_xticklabels)
ax.set_ylim(0,1.05*np.max(levels))
plt.show()
This mostly relies on creating the dictionary label_map, which maps the original "locations" (i.e. (2,5)) to the "xtick order" (i.e. (4,5)). A "fork" is created in each step i using mk_fork(), which returns both points (which are subsequently connected in a line plot) as well as the connector point, which is then stored as the new values for 'xpos','ypos' within the label_map.
I've added multiple print() statements to emphasize what happens at each step, and added a .text() to highlight the location of each "connector".
Result:
Related
I'm trying to square a particular axis of a multi dimensional array without using loop in python.
Here I will present the code with loop.
First, let's define a simple array
x = np.random.randint(1, size=(2, 3))
Since the value of the second axis is 3, we have x1, x2, x3. The square term of this array is x12, x22, x32, 2x1x2, 2x1x3, 2x2x3. In total, we have 9 terms.
Here is the full code:
import numpy as np
import time
x = np.random.randint(low=20, size=(2, 3))
print(x)
a, b = x.shape
for i in range(b):
XiXj = np.einsum('i, ij->ij', x[:, i], x[:, i:b])
x = np.concatenate((x, XiXj) , axis=1)
print(x)
Print:
[[ 3 12 18]
[12 10 10]]
[[ 3 12 18 9 36 54 144 216 324]
[ 12 10 10 144 120 120 100 100 100]]
Of course, this won't take long to compute. However, one may have the size of the array of [2000, 5000]. This will take awhile to compute.
How would you do it without the for loop?
I am wanting to do the following:
Fill NaN values in a single column using values within a specific range.
The range I am wanting to use is the mean of the non-Nan values in the column +/- 1 one standard
deviation of the computed mean.
NOTE If possible, I would like to be able to use multiples of the std dev by simply multiplying it by
a constant.
I thought I had it (see full code below) but the output from print(df['C'].describe()) shows that
I am generating values well outside my desired range. In fact, I am generating numbers outside
the original min and max of the column, which is definitely not what I want.
import pandas as pd
import numpy as np
import sys
print('Python: {}'.format(sys.version))
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('\033[1;31m' + '--------------' + '\033[0m') # Bold red
display_settings = {
'max_columns': 15,
'max_colwidth': 60,
'expand_frame_repr': False, # Wrap to multiple pages
'max_rows': 50,
'precision': 6,
'show_dimensions': False
}
# pd.options.display.float_format = '{:,.2f}'.format
for op, value in display_settings.items():
pd.set_option("display.{}".format(op), value)
df = pd.DataFrame(np.random.randint(0, 1000, size=(200, 10)), columns=list('ABCDEFGHIJ'))
# df = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list(['AA','BB','C2','D2']))
print(df, '\n')
# https://stackoverflow.com/questions/55149738/pandas-replace-values-with-nan-at-random
df['C'] = df['C'].sample(frac=0.65) # The percentage of non-NaN values.
df['H'] = df['H'].sample(frac=0.75) # The percentage of non-NaN values.
print(df, '\n')
print(df.isnull().sum(), '\n')
print(df['C'].describe(), '\n')
def fillNaN_with_unifrand(col):
a = col.values
m = np.isnan(a) # mask of NaNs
mu, sigma = col.mean(), col.std()
a[m] = np.random.normal(mu, sigma, size=m.sum())
return col
# https://stackoverflow.com/questions/46543060/how-to-replace-every-nan-in-a-column-with-different-random-values-using-pandas?rq=1
fillNaN_with_unifrand(df['C'])
pd.options.display.float_format = '{:.0f}'.format
print(df, '\n')
print(df.isnull().sum(), '\n')
print(df['C'].describe())
Output of print(df['C'].describe()):
Starting:
count 130.000000
mean 462.446154
std 290.760432
min 7.000000
25% 187.500000
50% 433.000000
75% 671.250000
max 992.000000
Name: C, dtype: float64
Ending:
count 200
mean 517
std 298
min -187
25% 281
50% 544
75% 763
max 1218
Name: C, dtype: float64
Note the min and max. All of my fill values (in this instance) should have been 462 +/- 290.
Well, this is not how statistics work. A Gaussian Normal Distribution has a mean and a std but values can be drawn far away from mean +- std, they are just less likeley. As per definition of a normal distribution, 68 % of all values are within +- 1*std, 95 % are within +-2*std and so on. The question is: What do you want to do with outliers? Set them to mean +- std or draw again?
Case 1: Set outliers to min/max
This is usually unwanted, as this changes your distribution and puts more weight on the lower and upper boundary.
from matplotlib import pyplot as plt
mu = 100
sigma = 7
a = np.random.normal(mu, sigma, size=2000) # I used a size of 2000 as an example
a[a<(mu-sigma)] = mu-sigma
a[a>(mu+sigma)] = mu+sigma
plt.hist(a, bins=12, edgecolor='black')
plt.show()
Case 2: Truncated Normal Distribution
What you usually want is the Truncated Normal Distribution. It creates a distribution with an upper and a lower boundary. You find this function at the scipy.stats module. It works a bit different though: you first create the distribution by normalizing the lower and upper clip and then you create a numer of random variates rvs from it like this:
from matplotlib import pyplot as plt
import scipy.stats as stats
mu = 100
sigma = 7
lower_clip = mu-sigma
upper_clip = mu+sigma
a = stats.truncnorm((lower_clip - mu) / sigma, (upper_clip - mu) / sigma, loc=mu, scale=sigma)
plt.hist(a.rvs(2000), bins=12, edgecolor='black')
plt.show()
The constant of multiples of sigma is easily implemented. You can just change your lower and upper clip like
lower_clip = mu-x*sigma
with x being your constant.
As the questions states, I want to apply a two-way Adaptive Thresholding technique to my image. That is to say, I want to find each pixel value in the neighborhood and set it to 255 if it is less than or greater than the mean of the neighborhood minus a constant c.
Take this image, for example, as the neighborhood of pixels. The desired pixel areas to keep are the darker areas on the third and sixth squares' upper-half (from left-to-right and top-to-bottom), as well as the eight and twelve squares' upper-half.
Obviously, this all depends on the set constant value, but ideally areas that are significantly different than the mean pixel value of the neighborhood will be kept. I can worry about the tuning myself though.
Your question and comment are contradictory: Keep everything (significantly) brighter/darker than the mean (+/- constant) of the neighbourhood (question) vs. keep everything within mean +/- constant (comment). I assume the first one to be the correct, and I'll try to give an answer.
Using cv2.adaptiveThreshold is certainly useful; parameterization might be tricky, especially given the example image. First, let's have a look at the output:
We see, that the intensity value range in the given image is small. The upper-halfs of the third and sixth' squares don't really differ from their neighbourhood. It's quite unlikely to find a proper difference there. The upper-halfs of squares #8 and #12 (or also the lower-half of square #10) are more likely to be found.
Top row now shows some more "global" parameters (blocksize = 151, c = 25), bottom row more "local" parameters (blocksize = 51, c = 5). Middle column is everything darker than the neighbourhood (with respect to the paramters), right column is everything brighter than the neighbourhood. We see, in the more "global" case, we get the proper upper-halfs, but there are mostly no "significant" darker areas. Looking, at the more "local" case, we see some darker areas, but we won't find the complete upper-/lower-halfs in question. That's just because how the different triangles are arranged.
On the technical side: You need two calls of cv2.adaptiveThreshold, one using the cv2.THRESH_BINARY_INV mode to find everything darker and one using the cv2.THRESH_BINARY mode to find everything brighter. Also, you have to provide c or -c for the two different cases.
Here's the full code:
import cv2
from matplotlib import pyplot as plt
from skimage import io # Only needed for web grabbing images
plt.figure(1, figsize=(15, 10))
img = cv2.cvtColor(io.imread('https://i.stack.imgur.com/dA1Vt.png'), cv2.COLOR_RGB2GRAY)
plt.subplot(2, 3, 1), plt.imshow(img, cmap='gray'), plt.colorbar()
# More "global" parameters
bs = 151
c = 25
img_le = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY_INV, bs, c)
img_gt = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, bs, -c)
plt.subplot(2, 3, 2), plt.imshow(img_le, cmap='gray')
plt.subplot(2, 3, 3), plt.imshow(img_gt, cmap='gray')
# More "local" parameters
bs = 51
c = 5
img_le = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY_INV, bs, c)
img_gt = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, bs, -c)
plt.subplot(2, 3, 5), plt.imshow(img_le, cmap='gray')
plt.subplot(2, 3, 6), plt.imshow(img_gt, cmap='gray')
plt.tight_layout()
plt.show()
Hope that helps – somehow!
-----------------------
System information
-----------------------
Python: 3.8.1
Matplotlib: 3.2.0rc1
OpenCV: 4.1.2
-----------------------
Another way to look at this is that where abs(mean - image) <= c, you want that to become white, otherwise you want that to become black. In Python/OpenCV/Scipy/Numpy, I first compute the local uniform mean (average) using a uniform 51x51 pixel block averaging filter (boxcar average). You could use some weighted averaging method such as the Gaussian average, if you want. Then I compute the abs(mean - image). Then I use Numpy thresholding. Note: You could also just use one simple threshold (cv2.threshold) on the abs(mean-image) result in place of two numpy thresholds.
Input:
import cv2
import numpy as np
from scipy import ndimage
# read image as grayscale
# convert to floats in the range 0 to 1 so that the difference keeps negative values
img = cv2.imread('squares.png',0).astype(np.float32)/255.0
# get uniform (51x51 block) average
ave = ndimage.uniform_filter(img, size=51)
# get abs difference between ave and img and convert back to integers in the range 0 to 255
diff = 255*np.abs(ave - img)
diff = diff.astype(np.uint8)
# threshold
# Note: could also just use one simple cv2.Threshold on diff
c = 5
diff_thresh = diff.copy()
diff_thresh[ diff_thresh <= c ] = 255
diff_thresh[ diff_thresh != 255 ] = 0
# view result
cv2.imshow("img", img)
cv2.imshow("ave", ave)
cv2.imshow("diff", diff)
cv2.imshow("threshold", diff_thresh)
cv2.waitKey(0)
cv2.destroyAllWindows()
# save result
cv2.imwrite("squares_2way_thresh.jpg", diff_thresh)
Result:
This is my dataset:
4095 546
3213 2059
4897 2661
...
3586 2583
3437 3317
3364 1216
Each line is a pair of nodes which have an edge between them. The whole dataset build an graph. But I want to get many node pairs which are disconnected with each other. How can I get 1000(or more) such node pairs from dataset? Such as:
2761 2788
4777 3365
3631 3553
...
3717 4074
3013 2225
Each line is a pair of nodes without edge.
Please see the part under the EDIT!
I think the other options are more general, and probably nicer from a programmatic view. I just had a quick idea how you could get the list in a very easy way using numpy.
First create the adjacency matrix and your list of nodes is an array:
import numpy as np
node_list= np.random.randint(10 , size=(10, 2))
A = np.zeros((np.max(node_list) + 1, np.max(node_list) + 1)) # + 1 to account for zero indexing
A[node_list[:, 0], node_list[:, 1]] = 1 # set connected nodes to 1
x, y = np.where(A == 0) # Find disconnected nodes
disconnected_list = np.vstack([x, y]).T # The final list of disconnected nodes
I have no idea though, how this will work with really large scale networks.
EDIT: The above solution was me thinking a bit too fast. As of now the solution above provides the missing edges between nodes, not the disconnected nodes (in the case of a directed graph). Furthermore, the disconnected_list includes the each node twice. Here is a hacky second idea of solution:
import numpy as np
node_list= np.random.randint(10 , size=(10, 2))
A = np.zeros((np.max(node_list) + 1, np.max(node_list) + 1)) # + 1 to account for zero indexing
A[node_list[:, 0], node_list[:, 1]] = 1 # set connected nodes to 1
A[node_list[:, 1], node_list[:, 0]] = 1 # Make the graph symmetric
A = A + np.triu(np.ones(A.shape)) # Add ones to the upper triangular
# matrix, so they are not considered in np.where (set k if you want to consider the diagonal)
x, y = np.where(A == 0) # Find disconnected nodes
disconnected_list = np.vstack([x, y]).T # The final list of disconnected nodes
Just do a BFS or DFS to get the size of every connected component in O(|E|) time. Then once you have the component sizes, you can get the number of disconnected nodes easily: it's the sum of the products of every pair of sizes.
Eg. If your graph has 3 connected components with sizes: 50, 20, 100. Then the number of pairs of disconnected nodes is: 50*20 + 50*100 + 20*100 = 8000.
If you want to actually output the disconnected pairs instead of just counting them, you should probably use union-find and then just iterate through all pairs of nodes and output them if they're not in the same component.
I have the following data separate by tab:
CHROM ms02g:PI num_Vars_by_PI range_of_PI total_haplotypes total_Vars
1 1,2 60,6 2820,81 2 66
2 9,8,10,7,11 94,78,10,69,25 89910,1102167,600,1621365,636 5 276
3 5,3,4,6 6,12,14,17 908,394,759,115656 4 49
4 17,18,22,16,19,21,20 22,11,3,16,7,12,6 1463,171,149,256,157,388,195 7 77
5 13,15,12,14 56,25,96,107 2600821,858,5666,1792 4 284
7 24,26,29,25,27,23,30,28,31 12,31,19,6,12,23,9,37,25 968,3353,489,116,523,1933,823,2655,331 9 174
8 33,32 53,35 1603,2991338 2 88
I am using this code to build a histogram plots with subplots for each CHROM:
with open(outputdir + '/' + 'hap_size_byVar_'+ soi +'_'+ prefix+'.png', 'wb') as fig_initial:
fig, ax = plt.subplots(nrows=len(hap_stats), sharex=True)
for i, data in hap_stats.iterrows():
# first convert data to list of integers
data_i = [int(x) for x in data['num_Vars_by_PI'].split(',')]
ax[i].hist(data_i, label=str(data['CHROM']), alpha=0.5)
ax[i].legend()
plt.xlabel('size of the haplotype (number of variants)')
plt.ylabel('frequency of the haplotypes')
plt.suptitle('histogram of size of the haplotype (number of variants) \n'
'for each chromosome')
plt.savefig(fig_initial)
Everything is fine except two problems:
The Y-label frequency of the haplotypes is not adjusted properly in this output plot.
When the data contain only one row (see data below) the subplot are not possible and I get TypeError, even though it should be able to make the subgroup with only one index.
Dataframe with only one line of data:
CHROM ms02g:PI num_Vars_by_PI range_of_PI total_haplotypes total_Vars
2 9,8,10,7,11 94,78,10,69,25 89910,1102167,600,1621365,636 5 276
TypeError :
Traceback (most recent call last):
File "phase-Extender.py", line 1806, in <module>
main()
File "phase-Extender.py", line 502, in main
compute_haplotype_stats(initial_haplotype, soi, prefix='initial')
File "phase-Extender.py", line 1719, in compute_haplotype_stats
ax[i].hist(data_i, label=str(data['CHROM']), alpha=0.5)
TypeError: 'AxesSubplot' object does not support indexing
How can I fix these two issues ?
Your first problem comes from the fact that you are using plt.ylabel() at the end of your loop. pyplot functions act on the current active axes object, which, in this case, is the last one created by subplots(). If you want your label to be centered over your subplots, the easiest might be to create a text object centered vertically in the figure.
# replace plt.ylabel('frequency of the haplotypes') with:
fig.text(.02, .5, 'frequency of the haplotypes', ha='center', va='center', rotation='vertical')
you can play around with the x-position (0.02) until you find a position you're happy with. The coordinates are in figure coordinates, (0,0) is bottom left (1,1) is top right. Using 0.5 as y position ensures the label is centered in the figure.
The second problem is due to the fact that, when numrows=1 plt.subplots() returns directly the axes object, instead of a list of axes. There are two options to circumvent this problem
1 - test whether you have only one line, and then replace ax with a list:
fig, ax = plt.subplots(nrows=len(hap_stats), sharex=True)
if len(hap_stats)==1:
ax = [ax]
(...)
2 - use the option squeeze=False in your call to plt.subplots(). As explained in the documentation, using this option will force subplots()to always return a 2D array. Therefore you'll have to modify a bit how you are indexing your axes:
fig, ax = plt.subplots(nrows=len(hap_stats), sharex=True, squeeze=False)
for i, data in hap_stats.iterrows():
(...)
ax[i,0].hist(data_i, label=str(data['CHROM']), alpha=0.5)
(...)