How to plot a histogram with plot.hist for continous data in a dataframe in pandas? - python-3.x

In this data set I need to plot,pH as the x-column which is having continuous data and need to group it together the pH axis as per the quality value and plot the histogram. In many of the resources I referred I found solutions for using random data generated. I tried this piece of code.
plt.hist(, density=True, bins=1)
plt.ylabel('quality')
plt.xlabel('pH');
Where I eliminated the random generated data, but I received and error
File "<ipython-input-16-9afc718b5558>", line 1
plt.hist(, density=True, bins=1)
^
SyntaxError: invalid syntax
What is the proper way to plot my data?I want to feed into the histogram not randomly generated data, but data found in the data set.

Your Error
The immediate problem in your code is the missing data to the plt.hist() command.
plt.hist(, density=True, bins=1)
should be something like:
plt.hist(data_table['pH'], density=True, bins=1)
Seaborn histplot
But this doesn't get the plot broken down by quality. The answer by Mr.T looks correct, but I'd also suggest seaborn which works with "melted" data like you have. The histplot command should give you what you want:
import seaborn as sns
sns.histplot(data=df, x="pH", hue="quality", palette="Dark2", element='step')
Assuming the table you posted is in a pandas.DataFrame named df with columns "pH" and "quality", you get something like:
The palette (Dark2) can can be any matplotlib colormap.
Subplots
If the overlaid histograms are too hard to see, an option is to do facets or small multiples. To do this with pandas and matplotlib:
# group dataframe by quality values
data_by_qual = df.groupby('quality')
# create a sub plot for each quality group
fig, axes = plt.subplots(nrows=len(data_by_qual),
figsize=[6,12],
sharex=True)
fig.subplots_adjust(hspace=.5)
# loop over axes and quality groups together
for ax, (quality, qual_data) in zip(axes, data_by_qual):
ax.hist(qual_data['pH'], bins=10)
ax.set_title(f"quality = {quality}")
ax.set_xlabel('pH')
Altair Facets
The plotting library altair can do this for you:
import altair as alt
alt.Chart(df).mark_bar().encode(
alt.X("pH:Q", bin=True),
y='count()',
).facet(row='quality')

Several possibilities here to represent multiple histograms. All have in common that the data have to be transformed from long to wide format - meaning, each category is in its own column:
import matplotlib.pyplot as plt
import pandas as pd
#test data generation
import numpy as np
np.random.seed(123)
n=300
df = pd.DataFrame({"A": np.random.randint(1, 100, n), "pH": 3*np.random.rand(n), "quality": np.random.choice([3, 4, 5, 6], n)})
df.pH += df.quality
#instead of this block you have to read here your stored data, e.g.,
#df = pd.read_csv("my_data_file.csv")
#check that it read the correct data
#print(df.dtypes)
#print(df.head(10))
#bringing the columns in the required wide format
plot_df = df.pivot(columns="quality")["pH"]
bin_nr=5
#creating three subplots for different ways to present the same histograms
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(6, 12))
ax1.hist(plot_df, bins=bin_nr, density=True, histtype="bar", label=plot_df.columns)
ax1.legend()
ax1.set_title("Basically bar graphs")
plot_df.plot.hist(stacked=True, bins=bin_nr, density=True, ax=ax2)
ax2.set_title("Stacked histograms")
plot_df.plot.hist(alpha=0.5, bins=bin_nr, density=True, ax=ax3)
ax3.set_title("Overlay histograms")
plt.show()
Sample output:
It is not clear, though, what you intended to do with just one bin and why your y-axis was labeled "quality" when this axis represents the frequency in a histogram.

Related

Plotting multiple datasets in single graph

I have many datasets taken from multiple excel files that I would like to plot on the same graph each with a different color.
I have created 4 spreadsheets with random data for testing.
The first column defines the measurement, the code should select one of this containing 5 rows of data (X, Y), and add them to a dataframe. The results should be 1 dataset for every file to be plot all together on the same graph and having each plot of a different color.
Spreadsheets
I have been using modified pieces of codes taken on here from people which were trying to do the same thing. The problem is that I cannot color each plot differently because the program counts them as one, because due to the pd.concat() it merge these into 1 line. Do you know how I could overcome this?
Other questions asking to plot multiple datasets in single graph are almost all about a small number of dataset, while in my case I have like 50, thus cannot create a subplot for each one of them, unless there is a way to do this automatically
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import os
from os import path
import sys
import openpyxl
# create a list of all excel files in the directory
xlsx_files=glob.glob(r'C:\Users\exx762\Desktop\*.xlsx')
files=[]
n=len(xlsx_files)
index=0
# select chunk of data needed from each file and add to dataframe
for file in xlsx_files:
index+=1
files.append(pd.read_excel(file))
df_files=pd.concat(files)
ph_loops=df_files[df_files['Measurement']==2]
x = ph_loops['X']
y = ph_loops['Y']
# plot elements in the dataframe
ax=plt.subplot()
colors=plt.cm.jet(np.linspace(0, 1, n))
ax.set_prop_cycle('color', list(colors))
ax.plot(x, y, marker='.', c=colors[index-1], linewidth=0.5, markersize=2)
print(colors[index-1])
ax.tick_params(axis='y', color='k')
ax.set_xlabel('X', fontsize=12, weight='bold')
ax.set_ylabel('Y', fontsize=12, weight='bold')
ax.set_title(file+'\n')
ax.tick_params(width=2)
plt.plot()
plt.show()
> Actual result
You can add an id field (I used name below) to the dataframes as you concatenate them, then you can plot in a loop. Example:
# Create example dataframes
dfs = []
for i in range(1, 4):
df = pd.DataFrame(np.random.randn(10, 2), columns=['x', 'y'])
df.insert(0, 'name', i)
dfs.append(df)
result = pd.concat(dfs, ignore_index=True)
# Plot
fig, ax = plt.subplots()
for name, group in result.groupby('name'):
group.plot(x='x', y='y', ax=ax, label=name)
plt.show()

Seaborn / Matplotlib: Subplots depending on one column

I have a Dataframe and based on its data, I draw lineplots for it.
The code currently looks as simple as that:
ax = sns.lineplot(x='datapoints', y='mean', hue='index', data=df)
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
Now, there actually is a column, called "klinger", which has 8 different values and I would like to get a plot consisting of eight subplots (4x2) for it, all sharing just one legend.
Is that an easy thing to do?
Currently, I generate sub-dfs by filtering and just draw eight diagrams and cut them together with a graphic tool, but this can't be the solution
You can get what you are looking for with sns.relplot and kind='line'.
Use col='klinger' to plot subplots as many as you need, col_wrap=4 will help to obtain 4x2 shape, and col_order=klinger_categories will select which categories you want to plot.
import numpy as np
import pandas as pd
import seaborn as sns
number = 100
klinger_categories = ['a','b','c','d','e','f','g','h']
data = {'datapoints': np.arange(number),
'mean': np.random.normal(0,1,size=number),
'index': np.random.choice(np.arange(2),size=number),
'klinger': np.random.choice(klinger_categories,size=number),
}
df = pd.DataFrame(data)
sns.relplot(
data=df, x='datapoints', y='mean', hue='index', kind='line',
col='klinger', col_wrap=4, col_order=klinger_categories
)

Plot several boxplots in one figure

I am using python-3.x and I would like to plot several boxplots in one figure, all the data from one numpy array where the shape of this array is (100, 301)
If I use the code below it will plot them all (I will have 301 boxplots in one figure which is too much)
fig, ax = plt.subplots()
ax.boxplot(my_data)
plt.show()
I don't want to plot all the data, I just want to plot 10, 15 or 20 (variable number) of the data by using for loop or any method that work best.
for example, I want to plot boxplots every 50 number of data that mean I will have around 6 boxplots from 301 in my figure, I tried to use for loop but no luck
Any advice would be much appreciated
You can just use indexing to plot every 50th data points using a variable step. To have separate box plots and avoid overlapping, you can specify the positions of individual box plot using the positions parameter. my_data[:, ::step] gives you the desired data to plot. Below is an example using some random data.
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
my_data = np.random.randint(0, 20, (100, 301))
step = 50
posit = range(my_data[:, ::step].shape[1])
ax.boxplot(my_data[:, ::step], positions=posit)
plt.show()

Seaborn barplot with two y-axis

considering the following pandas DataFrame:
labels values_a values_b values_x values_y
0 date1 1 3 150 170
1 date2 2 6 200 180
It is easy to plot this with Seaborn (see example code below). However, due to the big difference between values_a/values_b and values_x/values_y, the bars for values_a and values_b are not easily visible (actually, the dataset given above is just a sample and in my real dataset the difference is even bigger). Therefore, I would like to use two y-axis, i.e., one y-axis for values_a/values_b and one for values_x/values_y. I tried to use plt.twinx() to get a second axis but unfortunately, the plot shows only two bars for values_x and values_y, even though there are at least two y-axis with the right scaling. :) Do you have an idea how to fix that and get four bars for each label whereas the values_a/values_b bars relate to the left y-axis and the values_x/values_y bars relate to the right y-axis?
Thanks in advance!
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
columns = ["labels", "values_a", "values_b", "values_x", "values_y"]
test_data = pd.DataFrame.from_records([("date1", 1, 3, 150, 170),\
("date2", 2, 6, 200, 180)],\
columns=columns)
# working example but with unreadable values_a and values_b
test_data_melted = pd.melt(test_data, id_vars=columns[0],\
var_name="source", value_name="value_numbers")
g = sns.barplot(x=columns[0], y="value_numbers", hue="source",\
data=test_data_melted)
plt.show()
# values_a and values_b are not displayed
values1_melted = pd.melt(test_data, id_vars=columns[0],\
value_vars=["values_a", "values_b"],\
var_name="source1", value_name="value_numbers1")
values2_melted = pd.melt(test_data, id_vars=columns[0],\
value_vars=["values_x", "values_y"],\
var_name="source2", value_name="value_numbers2")
g1 = sns.barplot(x=columns[0], y="value_numbers1", hue="source1",\
data=values1_melted)
ax2 = plt.twinx()
g2 = sns.barplot(x=columns[0], y="value_numbers2", hue="source2",\
data=values2_melted, ax=ax2)
plt.show()
This is probably best suited for multiple sub-plots, but if you are truly set on a single plot, you can scale the data before plotting, create another axis and then modify the tick values.
Sample Data
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
columns = ["labels", "values_a", "values_b", "values_x", "values_y"]
test_data = pd.DataFrame.from_records([("date1", 1, 3, 150, 170),\
("date2", 2, 6, 200, 180)],\
columns=columns)
test_data_melted = pd.melt(test_data, id_vars=columns[0],\
var_name="source", value_name="value_numbers")
Code:
# Scale the data, just a simple example of how you might determine the scaling
mask = test_data_melted.source.isin(['values_a', 'values_b'])
scale = int(test_data_melted[~mask].value_numbers.mean()
/test_data_melted[mask].value_numbers.mean())
test_data_melted.loc[mask, 'value_numbers'] = test_data_melted.loc[mask, 'value_numbers']*scale
# Plot
fig, ax1 = plt.subplots()
g = sns.barplot(x=columns[0], y="value_numbers", hue="source",\
data=test_data_melted, ax=ax1)
# Create a second y-axis with the scaled ticks
ax1.set_ylabel('X and Y')
ax2 = ax1.twinx()
# Ensure ticks occur at the same positions, then modify labels
ax2.set_ylim(ax1.get_ylim())
ax2.set_yticklabels(np.round(ax1.get_yticks()/scale,1))
ax2.set_ylabel('A and B')
plt.show()

Second y-axis and overlapping labeling?

I am using python for a simple time-series analysis of calory intake. I am plotting the time series and the rolling mean/std over time. It looks like this:
Here is how I do it:
## packages & libraries
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from pandas import Series, DataFrame, Panel
## import data and set time series structure
data = pd.read_csv('time_series_calories.csv', parse_dates={'dates': ['year','month','day']}, index_col=0)
## check ts for stationarity
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries):
#Determing rolling statistics
rolmean = pd.rolling_mean(timeseries, window=14)
rolstd = pd.rolling_std(timeseries, window=14)
#Plot rolling statistics:
orig = plt.plot(timeseries, color='blue',label='Original')
mean = plt.plot(rolmean, color='red', label='Rolling Mean')
std = plt.plot(rolstd, color='black', label = 'Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show()
The plot doesn't look good - since the rolling std distorts the scale of variation and the x-axis labelling is screwed up. I have two question: (1) How can I plot the rolling std on a secony y-axis? (2) How can I fix the x-axis overlapping labeling?
EDIT
With your help I managed to get the following:
But do I get the legend sorted out?
1) Making a second (twin) axis can be done with ax2 = ax1.twinx(), see here for an example. Is this what you needed?
2) I believe there are several old answers to this question, i.e. here, here and here. According to the links provided, the easiest way is probably to use either plt.xticks(rotation=70) or plt.setp( ax.xaxis.get_majorticklabels(), rotation=70 ) or fig.autofmt_xdate().
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot([1, 2, 3, 4, 5], [1, 2, 3, 4, 5])
plt.xticks(rotation=70) # Either this
ax.set_xticks([1, 2, 3, 4, 5])
ax.set_xticklabels(['aaaaaaaaaaaaaaaa','bbbbbbbbbbbbbbbbbb','cccccccccccccccccc','ddddddddddddddddddd','eeeeeeeeeeeeeeeeee'])
# fig.autofmt_xdate() # or this
# plt.setp( ax.xaxis.get_majorticklabels(), rotation=70 ) # or this works
fig.tight_layout()
plt.show()
Answer to Edit
When sharing lines between different axes into one legend is to create some fake-plots into the axis you want to have the legend as:
ax1.plot(something, 'r--') # one plot into ax1
ax2.plot(something else, 'gx') # another into ax2
# create two empty plots into ax1
ax1.plot([][], 'r--', label='Line 1 from ax1') # empty fake-plot with same lines/markers as first line you want to put in legend
ax1.plot([][], 'gx', label='Line 2 from ax2') # empty fake-plot as line 2
ax1.legend()
In my silly example it is probably better to label the original plot in ax1, but I hope you get the idea. The important thing is to create the "legend-plots" with the same line and marker settings as the original plots. Note that the fake-plots will not be plotted since there is no data to plot.

Resources