heatmap based on ratios in Python's seaborn - python-3.x

I have data in Cartesian coordinates. To each Cartesian coordinate there is also binary variable. I wan to make a heatmap, where in each polygon (hexagon/rectangle,etc.) the color strength is the ratio of number of occurrences where the boolean is True out of the total occurrences in that polygon.
The data can for example look like this:
df = pd.DataFrame([[1,2,False],[-1,5,True], [51,52,False]])
I know that seaborn can generate heatmaps via seaborn.heatmap, but the color strength is based by default on the total occurrences in each polygon, not the above ratio. Is there perhaps another plotting tool that would be more suitable?

You could also use the pandas groupby functionality to compute the ratios and then pass the result to seaborn.heatmap. With the example data borrowed from #ImportanceOfBeingErnest it would look like this:
import numpy as np
import pandas as pd
import seaborn as sns
np.random.seed(0)
x = np.random.poisson(5, size=200)
y = np.random.poisson(7, size=200)
z = np.random.choice([True, False], size=200, p=[0.3, 0.7])
df = pd.DataFrame({"x" : x, "y" : y, "z":z})
res = df.groupby(['y','x'])['z'].mean().unstack()
ax = sns.heatmap(res)
ax.axis('equal')
ax.invert_yaxis()
the resulting plot
If your x and y values aren't integers you can cut them into the desired number of categories for grouping:
bins = 10
res = df.groupby([pd.cut(df.y, bins),pd.cut(df.x,bins)])['z'].mean().unstack()

An option would be to calculate two histograms, one for the complete dataframe, and one for the dataframe filtered for the True values. Then dividing the latter by the former gives the ratio, you're after.
from __future__ import division
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x = np.random.poisson(5, size=200)
y = np.random.poisson(7, size=200)
z = np.random.choice([True, False], size=200, p=[0.3, 0.7])
df = pd.DataFrame({"x" : x, "y" : y, "z":z})
dftrue = df[df["z"] == True]
bins = np.arange(0,22)
hist, xbins, ybins = np.histogram2d(df.x, df.y, bins=bins)
histtrue, _ ,__ = np.histogram2d(dftrue.x, dftrue.y, bins=bins)
plt.imshow(histtrue/hist, cmap=plt.cm.Reds)
plt.colorbar()
plt.show()

Related

Is it possible to switch X axis in Python matplotlib.pyplot.hist from bin edges to exact values?

Is it possible to switch X axis in Python matplotlib.pyplot.hist from bin edges to exact values?
In other words this is what I get:
dataset = [0,1,1,1,2,2,3,3,4]
plt.hist(dataset, 5, rwidth=0.9)
and this what I need:
You can first compute the frequencies and then use a bar plot
from collections import Counter
import matplotlib.pyplot as plt
dataset = [0,1,1,1,2,2,3,3,4]
freqs = Counter(dataset)
plt.bar(freqs.keys(), freqs.values(), width=0.9)
plt.show()

Assign edge weights to a networkx graph using pandas dataframe

I am contructing a networkx graph in python 3. I am using a pandas dataframe to supply the edges and nodes to the graph. Here is what I have done :
test = pd.read_csv("/home/Desktop/test_call1", delimiter = ';')
g_test = nx.from_pandas_edgelist(test, 'number', 'contactNumber', edge_attr='callDuration')
What I want is that the "callDuration" column of the pandas dataframe act as the weight of the edges for the networkx graph and the thickness of the edges also change accordingly.
I also want to get the 'n' maximum weighted edges.
Let's try:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
df = pd.DataFrame({'number':['123','234','345'],'contactnumber':['234','345','123'],'callduration':[1,2,4]})
df
G = nx.from_pandas_edgelist(df,'number','contactnumber', edge_attr='callduration')
durations = [i['callduration'] for i in dict(G.edges).values()]
labels = [i for i in dict(G.nodes).keys()]
labels = {i:i for i in dict(G.nodes).keys()}
fig, ax = plt.subplots(figsize=(12,5))
pos = nx.spring_layout(G)
nx.draw_networkx_nodes(G, pos, ax = ax, labels=True)
nx.draw_networkx_edges(G, pos, width=durations, ax=ax)
_ = nx.draw_networkx_labels(G, pos, labels, ax=ax)
Output:
Do not agree with what has been said. In the calcul of different metrics that takes into account the weight of each edges like the pagerank or the betweeness centrality your weight would not be taking into account if is store as an edge attributes.
Use graph.
Add_edges(source, target, weight, *attrs)

Plotting a scatter over a dataset

I have a dataset from which I intent to make a scatterplot. It consists of 2 columns, where the first column should be used as x, and the other as y, so that each dot = x[0 firstcolumn], x[0 secondcolumn].
However I keep getting "x and y must be same size", and I cannot make out how to plot this. Below is my latest attempt on making them the same size, however unsuccesful
import numpy as np
import matplotlib.pyplot as plt
X = numpy.loadtxt('data')
x = range(len(X))
plt.scatter(x,X, color='blue', label = "car")
plt.show()

Plotting a chart a plot in which the Y text data and X numeric data from dictionary. Matplotlib & Python 3 [duplicate]

I can create a simple columnar diagram in a matplotlib according to the 'simple' dictionary:
import matplotlib.pyplot as plt
D = {u'Label1':26, u'Label2': 17, u'Label3':30}
plt.bar(range(len(D)), D.values(), align='center')
plt.xticks(range(len(D)), D.keys())
plt.show()
But, how do I create curved line on the text and numeric data of this dictionarie, I do not know?
ΠΆ_OLD = {'10': 'need1', '11': 'need2', '12': 'need1', '13': 'need2', '14': 'need1'}
Like the picture below
You may use numpy to convert the dictionary to an array with two columns, which can be plotted.
import matplotlib.pyplot as plt
import numpy as np
T_OLD = {'10' : 'need1', '11':'need2', '12':'need1', '13':'need2','14':'need1'}
x = list(zip(*T_OLD.items()))
# sort array, since dictionary is unsorted
x = np.array(x)[:,np.argsort(x[0])].T
# let second column be "True" if "need2", else be "False
x[:,1] = (x[:,1] == "need2").astype(int)
# plot the two columns of the array
plt.plot(x[:,0], x[:,1])
#set the labels accordinly
plt.gca().set_yticks([0,1])
plt.gca().set_yticklabels(['need1', 'need2'])
plt.show()
The following would be a version, which is independent on the actual content of the dictionary; only assumption is that the keys can be converted to floats.
import matplotlib.pyplot as plt
import numpy as np
T_OLD = {'10': 'run', '11': 'tea', '12': 'mathematics', '13': 'run', '14' :'chemistry'}
x = np.array(list(zip(*T_OLD.items())))
u, ind = np.unique(x[1,:], return_inverse=True)
x[1,:] = ind
x = x.astype(float)[:,np.argsort(x[0])].T
# plot the two columns of the array
plt.plot(x[:,0], x[:,1])
#set the labels accordinly
plt.gca().set_yticks(range(len(u)))
plt.gca().set_yticklabels(u)
plt.show()
Use numeric values for your y-axis ticks, and then map them to desired strings with plt.yticks():
import matplotlib.pyplot as plt
import pandas as pd
# example data
times = pd.date_range(start='2017-10-17 00:00', end='2017-10-17 5:00', freq='H')
data = np.random.choice([0,1], size=len(times))
data_labels = ['need1','need2']
fig, ax = plt.subplots()
ax.plot(times, data, marker='o', linestyle="None")
plt.yticks(data, data_labels)
plt.xlabel("time")
Note: It's generally not a good idea to use a line graph to represent categorical changes in time (e.g. from need1 to need2). Doing that gives the visual impression of a continuum between time points, which may not be accurate. Here, I changed the plotting style to points instead of lines. If for some reason you need the lines, just remove linestyle="None" from the call to plt.plot().
UPDATE
(per comments)
To make this work with a y-axis category set of arbitrary length, use ax.set_yticks() and ax.set_yticklabels() to map to y-axis values.
For example, given a set of potential y-axis values labels, let N be the size of a subset of labels (here we'll set it to 4, but it could be any size).
Then draw a random sample data of y values and plot against time, labeling the y-axis ticks based on the full set labels. Note that we still use set_yticks() first with numerical markers, and then replace with our category labels with set_yticklabels().
labels = np.array(['A','B','C','D','E','F','G'])
N = 4
# example data
times = pd.date_range(start='2017-10-17 00:00', end='2017-10-17 5:00', freq='H')
data = np.random.choice(np.arange(len(labels)), size=len(times))
fig, ax = plt.subplots(figsize=(15,10))
ax.plot(times, data, marker='o', linestyle="None")
ax.set_yticks(np.arange(len(labels)))
ax.set_yticklabels(labels)
plt.xlabel("time")
This gives the exact desired plot:
import matplotlib.pyplot as plt
from collections import OrderedDict
T_OLD = {'10' : 'need1', '11':'need2', '12':'need1', '13':'need2','14':'need1'}
T_SRT = OrderedDict(sorted(T_OLD.items(), key=lambda t: t[0]))
plt.plot(map(int, T_SRT.keys()), map(lambda x: int(x[-1]), T_SRT.values()),'r')
plt.ylim([0.9,2.1])
ax = plt.gca()
ax.set_yticks([1,2])
ax.set_yticklabels(['need1', 'need2'])
plt.title('T_OLD')
plt.xlabel('time')
plt.ylabel('need')
plt.show()
For Python 3.X the plotting lines needs to explicitly convert the map() output to lists:
plt.plot(list(map(int, T_SRT.keys())), list(map(lambda x: int(x[-1]), T_SRT.values())),'r')
as in Python 3.X map() returns an iterator as opposed to a list in Python 2.7.
The plot uses the dictionary keys converted to ints and last elements of need1 or need2, also converted to ints. This relies on the particular structure of your data, if the values where need1 and need3 it would need a couple more operations.
After plotting and changing the axes limits, the program simply modifies the tick labels at y positions 1 and 2. It then also adds the title and the x and y axis labels.
Important part is that the dictionary/input data has to be sorted. One way to do it is to use OrderedDict. Here T_SRT is an OrderedDict object sorted by keys in T_OLD.
The output is:
This is a more general case for more values/labels in T_OLD. It assumes that the label is always 'needX' where X is any number. This can readily be done for a general case of any string preceding the number though it would require more processing,
import matplotlib.pyplot as plt
from collections import OrderedDict
import re
T_OLD = {'10' : 'need1', '11':'need8', '12':'need11', '13':'need1','14':'need3'}
T_SRT = OrderedDict(sorted(T_OLD.items(), key=lambda t: t[0]))
x_val = list(map(int, T_SRT.keys()))
y_val = list(map(lambda x: int(re.findall(r'\d+', x)[-1]), T_SRT.values()))
plt.plot(x_val, y_val,'r')
plt.ylim([0.9*min(y_val),1.1*max(y_val)])
ax = plt.gca()
y_axis = list(set(y_val))
ax.set_yticks(y_axis)
ax.set_yticklabels(['need' + str(i) for i in y_axis])
plt.title('T_OLD')
plt.xlabel('time')
plt.ylabel('need')
plt.show()
This solution finds the number at the end of the label using re.findall to accommodate for the possibility of multi-digit numbers. Previous solution just took the last component of the string because numbers were single digit. It still assumes that the number for plotting position is the last number in the string, hence the [-1]. Again for Python 3.X map output is explicitly converted to list, step not necessary in Python 2.7.
The labels are now generated by first selecting unique y-values using set and then renaming their labels through concatenation of the strings 'need' with its corresponding integer.
The limits of y-axis are set as 0.9 of the minimum value and 1.1 of the maximum value. Rest of the formatting is as before.
The result for this test case is:

Specifying the color Increments of heat-map in python

Is there a way to specify in Seaborn or Matplotlib the color increments of heat-map color scale. For instance, for data-frame that contains normalized values between 0-1, to specify 100,discrete, color increments so each value is distinguished from other values?
Thank you in advance
There are two principle approaches to discetize a heatmap into n colors:
Supply the data rounded to the n values.
Use a discrete colormap.
The following code shows those two options.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x, y = np.meshgrid(range(15),range(6))
v = np.random.rand(len(x.flatten()))
df = pd.DataFrame({"x":x.flatten(), "y":y.flatten(),"value":v})
df = df.pivot(index="y", columns="x", values="value")
n = 4.
fig, (ax0, ax, ax2) = plt.subplots(nrows=3)
### original
im0 = ax0.imshow(df.values, cmap="viridis", vmin=0, vmax=1)
ax0.set_title("original")
### Discretize array
arr = np.floor(df.values * n)/n
im = ax.imshow(arr, cmap="viridis", vmin=0, vmax=1)
ax.set_title("discretize values")
### Discretize colormap
cmap = plt.cm.get_cmap("viridis", n)
im2 = ax2.imshow(df.values, cmap=cmap, vmin=0, vmax=1 )
ax2.set_title("discretize colormap")
#colorbars
fig.colorbar(im0, ax=ax0)
fig.colorbar(im, ax=ax)
fig.colorbar(im2, ax=ax2, ticks=np.arange(0,1,1./n), )
plt.tight_layout()
plt.show()

Resources