Bin and count data to normalize Y axis - python-3.x

I have this area and diameter data:
DIAMETER AREA
0 3.039085 1230000
1 2.763617 1230000
2 2.052176 1230000
3 9.498093 1230000
4 2.680360 1230000
I want to bin the data by 1 (2-3,3-4 etc.) and count the number of diameters that are in those bins so that it's organized something like this:
2-3 3-4 4-5 5-6 6-7 7-8 8-9 9-10
3 1 0 0 0 0 0 1
My end goal is to then grab these counts and divide them by the area in order to normalize the counts.
Lastly I will plot the normalized counts (y) by the bins (x).
I tried to use a method using pd.cut but it didn't work

To do this you are wanting a histogram. This can easily be achieved with the hist method of your pandas DataFrame (this itself, just uses the hist plot method from Matplotlib, which uses NumPy's histogram function), for example:
import pandas as pd
from matplotlib import pyplot as plt
# create a version of the data for the example
data = pd.DataFrame({"DIAMETER": [3.039085, 2.763617, 2.052176, 9.498093, 2.680360]})
fig, ax = plt.subplots() # create a figure to plot onto
bins = [2, 3, 4, 5, 6, 7, 8, 9, 10] # set the histogram bin edges
data.hist(
column="DIAMETER", # the column to histogram
bins=bins, # set the bin edges
density=True, # this normalises the histogram
ax=ax, # the Matplotlib axis onto which to plot
)
fig.show() # show the plot
This gives:
where the Pandas hist function will automatically add a plot title based on the column name.
If you don't specify the bins keyword, then it will automatically generate 10 bins bounded by the range of your data, but these will not necessarily be integer spaced. If you wanted to ensure integer spaced bins for arbitrary data, then you could set the bins with:
import numpy as np
bins = np.arange(
np.floor(data["DIAMETER"].min()),
np.ceil(data["DIAMETER"].max()) + 1,
)
If you want a plot like suggested in your other post, then purely with NumPy and Matplotlib you could do:
import numpy as np
from matplotlib import pyplot as plt
# set bins
bins = np.arange(
np.floor(data["DIAMETER"].min()) - 1, # subtract 1, so we have a zero count bin at the start
np.ceil(data["DIAMETER"].max()) + 2, # add 2, so we have a zero count bin at the end
)
# generate histogram
counts, _ = np.histogram(
data["DIAMETER"],
bins=bins,
density=True, # normalise histogram
)
dx_2 = 0.5 * (bins[1] - bins[0]) # bin half spacing (we know this is 1/2, but lets calculate it in case you change it!)
# plot
fig, ax = plt.subplots()
ax.plot(
bins[:-1] + dx_2,
counts,
marker="s", # square marker like the plot from your other question
)
ax.set_xlabel("Diameter")
ax.set_ylabel("Probability density")
fig.show()
which gives:

Related

How to set axis ticks with non periodical increment in matplolib

I have a 2D array representing the efficiency of a process for a given set of parameters A and B. The parameter A along the columns changes periodically, starting from 0 to 225 with increment one. The problem is with the rows where the parameter was changed in the following order:
[16 ,18 ,20 ,21 ,22 ,23 ,24 ,25 ,26 ,27 ,28 ,29 ,30 ,31 ,32 ,33 ,35 ,40 ,45 ,50 ,55 ,60 ,65 ,70 ,75 ,80 ,85 ,90 ,95 ,100 ,105 ,110 ,115 ,120 ,125]
So even though the rows increase with increment one, they represent a non-uniform increment of the parameter B. What I need is to showcase the values of the parameter B on the y-axis. Using axes.set_yticks() does not give me what I am looking for, and I do understand why but I do not know how to solve it.
A minimum example:
# Define parameter B values
parb_increment = [16, 18, 20] + list(range(21,34)) + list(range(35,126,5))
print(len(parb_increment))
print(x.shape)
# Figure and axes
figure, axes = plt.subplots(figsize=(10, 8))
# Plotting
im = axes.imshow(x, aspect='auto',
origin="lower",
cmap='Blues',
interpolation='none',
extent=(0, x.shape[1], 0, parb_increment[-1]))
# Unsuccessful trial for yticks
axes.set_yticks(parb_increment, labels=parb_increment)
# Colorbar
cb = figure.colorbar(im, ax=axes)
The previous code gives the figure and output below, and you can see how the ticks are not only misplaced but also start from an incorrect position.
35
(35, 225)
The item that controls the width/height of each pixel is aspect. Unfortunately you can't make it variable. The aspect won't change even if you modify/update y-axis ticks. That's why in your example ticks are mis-aligned with the rows of pixels.
Therefore, the solution to your problem is to duplicate those rows that increment non-uniformly.
See example below:
import numpy as np
import matplotlib.pyplot as plt
# Generate fake data
x = np.random.random((3, 4))
# Create uniform x-ticks and non-uniform y-ticks
x_increment = np.arange(0, x.shape[1]+1, 1)
y_increment = np.arange(0, x.shape[0]+1, 1) * np.arange(0, x.shape[0]+1, 1)
# Plot the data
fig, ax = plt.subplots(figsize=(6, 10))
img = ax.imshow(
x,
extent=(
0, x.shape[1], 0, y_increment[-1]
)
)
fig.colorbar(img, ax=ax)
ax.set_xlim(0, x.shape[1])
ax.set_xticks(x_increment)
ax.set_ylim(0, y_increment[-1])
ax.set_yticks(y_increment);
This replicates your problem and produces the following outcome.
The solution
First, determine the number of repeats of each row in the array:
nr_of_repeats_per_row =np.diff(y_increment)
nr_of_repeats_per_row = nr_of_repeats_per_row[::-1]
You need to reverse the order as the top row in the image is the first row in the array and y_increments provide the difference between rows starting from the last row in the array.
Now you can repeat each row in the array a specific number of times:
x_extended = np.repeat(x, nr_of_repeats_per_row, axis=0)
Replot with the x_extended:
fig, ax = plt.subplots(figsize=(6, 10))
img = ax.imshow(
x_extended,
extent=(
0, x.shape[1], 0, y_increment[-1]
),
interpolation="none"
)
fig.colorbar(img, ax=ax)
ax.set_xlim(0, x.shape[1])
ax.set_xticks(x_increment)
ax.set_ylim(0, y_increment[-1])
ax.set_yticks(y_increment);
And you should get this.

How do I plot the dataframe with only 2 columns(text and int)?

index reviews label
0 0 i admit the great majority of... 1
1 1 take a low budget inexperienced ... 0
2 2 everybody has seen back to th... 1
3 3 doris day was an icon of b... 0
4 4 after a series of silly fun ... 0
I've a dataframe of movie reviews and I've predicted label column(1-postive , 0-negative review) using kmeans.labels_ . How do I visualise /plot the above?
Desired output: scatter plot of 1's and 0's
Code tried :
colors = ['red', 'blue']
pred_colors = [colors[label] for label in km.labels_]
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(x='index',y='label',c=pred_colors)
Output: Plot with a red dot at center
This plot comes from:
http://www3.ntu.edu.sg/home/ehchua/programming/webprogramming/Python4_DataAnalysis.html
You do not have values to plot on the x-axis, so we can simply use the index.
The reviews could be added to data as another column.
import pandas as pd
from matplotlib import pyplot as plt
data = [1,0,1,0,0]
df = pd.DataFrame(data, index=range(5), columns=['label'])
#
# line plot
#df.reset_index().plot(x='index', y='label') # turn index into column for plotting on x-axis
#
# scatter plot
ax1 = df.reset_index().plot.scatter(x='index', y='label', c='DarkBlue')
#
plt.tight_layout() # helps prevent labels from being cropped
plt.show()

Seaborn barplot with two y-axis

considering the following pandas DataFrame:
labels values_a values_b values_x values_y
0 date1 1 3 150 170
1 date2 2 6 200 180
It is easy to plot this with Seaborn (see example code below). However, due to the big difference between values_a/values_b and values_x/values_y, the bars for values_a and values_b are not easily visible (actually, the dataset given above is just a sample and in my real dataset the difference is even bigger). Therefore, I would like to use two y-axis, i.e., one y-axis for values_a/values_b and one for values_x/values_y. I tried to use plt.twinx() to get a second axis but unfortunately, the plot shows only two bars for values_x and values_y, even though there are at least two y-axis with the right scaling. :) Do you have an idea how to fix that and get four bars for each label whereas the values_a/values_b bars relate to the left y-axis and the values_x/values_y bars relate to the right y-axis?
Thanks in advance!
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
columns = ["labels", "values_a", "values_b", "values_x", "values_y"]
test_data = pd.DataFrame.from_records([("date1", 1, 3, 150, 170),\
("date2", 2, 6, 200, 180)],\
columns=columns)
# working example but with unreadable values_a and values_b
test_data_melted = pd.melt(test_data, id_vars=columns[0],\
var_name="source", value_name="value_numbers")
g = sns.barplot(x=columns[0], y="value_numbers", hue="source",\
data=test_data_melted)
plt.show()
# values_a and values_b are not displayed
values1_melted = pd.melt(test_data, id_vars=columns[0],\
value_vars=["values_a", "values_b"],\
var_name="source1", value_name="value_numbers1")
values2_melted = pd.melt(test_data, id_vars=columns[0],\
value_vars=["values_x", "values_y"],\
var_name="source2", value_name="value_numbers2")
g1 = sns.barplot(x=columns[0], y="value_numbers1", hue="source1",\
data=values1_melted)
ax2 = plt.twinx()
g2 = sns.barplot(x=columns[0], y="value_numbers2", hue="source2",\
data=values2_melted, ax=ax2)
plt.show()
This is probably best suited for multiple sub-plots, but if you are truly set on a single plot, you can scale the data before plotting, create another axis and then modify the tick values.
Sample Data
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
columns = ["labels", "values_a", "values_b", "values_x", "values_y"]
test_data = pd.DataFrame.from_records([("date1", 1, 3, 150, 170),\
("date2", 2, 6, 200, 180)],\
columns=columns)
test_data_melted = pd.melt(test_data, id_vars=columns[0],\
var_name="source", value_name="value_numbers")
Code:
# Scale the data, just a simple example of how you might determine the scaling
mask = test_data_melted.source.isin(['values_a', 'values_b'])
scale = int(test_data_melted[~mask].value_numbers.mean()
/test_data_melted[mask].value_numbers.mean())
test_data_melted.loc[mask, 'value_numbers'] = test_data_melted.loc[mask, 'value_numbers']*scale
# Plot
fig, ax1 = plt.subplots()
g = sns.barplot(x=columns[0], y="value_numbers", hue="source",\
data=test_data_melted, ax=ax1)
# Create a second y-axis with the scaled ticks
ax1.set_ylabel('X and Y')
ax2 = ax1.twinx()
# Ensure ticks occur at the same positions, then modify labels
ax2.set_ylim(ax1.get_ylim())
ax2.set_yticklabels(np.round(ax1.get_yticks()/scale,1))
ax2.set_ylabel('A and B')
plt.show()

Matplotlib sliding window not plotting correctly

I have a code that runs a rolling window (30) average over a range (i.e. 300)
So I have 10 averages but they plot against ticks 1-10 rather than spaced over every window of 30.
The only way I can get it to look right is to plot it over (len(windowlength)) but the x-axis isnt right.
Is there any way to manually space the results?
windows30 = (sliding_window(sequence, 30))
Overall_Mean = mean(sequence)
fig, (ax) = plt.subplots()
plt.subplots_adjust(left=0.07, bottom=0.08, right=0.96, top=0.92, wspace=0.20, hspace=0.23)
ax.set_ylabel('mean (%)')
ax.set_xlabel(' Length') # axis titles
ax.yaxis.grid(True, linestyle='-', which='major', color='lightgrey', alpha=0.5)
ax.plot(windows30, color='r', marker='o', markersize=3)
ax.plot([0, len(sequence)], [Overall_Mean, Overall_Mean], lw=0.75)
plt.show()
From what I have understood you have a list of length 300 but only holds 10 values inside. If that is the case, you can remove the other values that are None from your windows30 list using the following solution.
Code Demonstration:
import numpy as np
import random
import matplotlib.pyplot as plt
# Generating the list of Nones and numbers
listofzeroes = [None] * 290
numbers = random.sample(range(50), 10)
numbers.extend(listofzeroes)
# Removing Nones from the list
numbers = [value for value in numbers if value is not None]
step = len(numbers)
x_values = np.linspace(0,300,step) # Generate x-values
plt.plot(x_values,numbers, color='red', marker='o')
This is a working example, the relevant code for you is after the second comment.
Output:
The above code will work independently of where the Nones are located in your list. I hope this solves your problem.

Second y-axis and overlapping labeling?

I am using python for a simple time-series analysis of calory intake. I am plotting the time series and the rolling mean/std over time. It looks like this:
Here is how I do it:
## packages & libraries
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from pandas import Series, DataFrame, Panel
## import data and set time series structure
data = pd.read_csv('time_series_calories.csv', parse_dates={'dates': ['year','month','day']}, index_col=0)
## check ts for stationarity
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries):
#Determing rolling statistics
rolmean = pd.rolling_mean(timeseries, window=14)
rolstd = pd.rolling_std(timeseries, window=14)
#Plot rolling statistics:
orig = plt.plot(timeseries, color='blue',label='Original')
mean = plt.plot(rolmean, color='red', label='Rolling Mean')
std = plt.plot(rolstd, color='black', label = 'Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show()
The plot doesn't look good - since the rolling std distorts the scale of variation and the x-axis labelling is screwed up. I have two question: (1) How can I plot the rolling std on a secony y-axis? (2) How can I fix the x-axis overlapping labeling?
EDIT
With your help I managed to get the following:
But do I get the legend sorted out?
1) Making a second (twin) axis can be done with ax2 = ax1.twinx(), see here for an example. Is this what you needed?
2) I believe there are several old answers to this question, i.e. here, here and here. According to the links provided, the easiest way is probably to use either plt.xticks(rotation=70) or plt.setp( ax.xaxis.get_majorticklabels(), rotation=70 ) or fig.autofmt_xdate().
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot([1, 2, 3, 4, 5], [1, 2, 3, 4, 5])
plt.xticks(rotation=70) # Either this
ax.set_xticks([1, 2, 3, 4, 5])
ax.set_xticklabels(['aaaaaaaaaaaaaaaa','bbbbbbbbbbbbbbbbbb','cccccccccccccccccc','ddddddddddddddddddd','eeeeeeeeeeeeeeeeee'])
# fig.autofmt_xdate() # or this
# plt.setp( ax.xaxis.get_majorticklabels(), rotation=70 ) # or this works
fig.tight_layout()
plt.show()
Answer to Edit
When sharing lines between different axes into one legend is to create some fake-plots into the axis you want to have the legend as:
ax1.plot(something, 'r--') # one plot into ax1
ax2.plot(something else, 'gx') # another into ax2
# create two empty plots into ax1
ax1.plot([][], 'r--', label='Line 1 from ax1') # empty fake-plot with same lines/markers as first line you want to put in legend
ax1.plot([][], 'gx', label='Line 2 from ax2') # empty fake-plot as line 2
ax1.legend()
In my silly example it is probably better to label the original plot in ax1, but I hope you get the idea. The important thing is to create the "legend-plots" with the same line and marker settings as the original plots. Note that the fake-plots will not be plotted since there is no data to plot.

Resources