Seaborn / Matplotlib: Subplots depending on one column - python-3.x

I have a Dataframe and based on its data, I draw lineplots for it.
The code currently looks as simple as that:
ax = sns.lineplot(x='datapoints', y='mean', hue='index', data=df)
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
Now, there actually is a column, called "klinger", which has 8 different values and I would like to get a plot consisting of eight subplots (4x2) for it, all sharing just one legend.
Is that an easy thing to do?
Currently, I generate sub-dfs by filtering and just draw eight diagrams and cut them together with a graphic tool, but this can't be the solution

You can get what you are looking for with sns.relplot and kind='line'.
Use col='klinger' to plot subplots as many as you need, col_wrap=4 will help to obtain 4x2 shape, and col_order=klinger_categories will select which categories you want to plot.
import numpy as np
import pandas as pd
import seaborn as sns
number = 100
klinger_categories = ['a','b','c','d','e','f','g','h']
data = {'datapoints': np.arange(number),
'mean': np.random.normal(0,1,size=number),
'index': np.random.choice(np.arange(2),size=number),
'klinger': np.random.choice(klinger_categories,size=number),
}
df = pd.DataFrame(data)
sns.relplot(
data=df, x='datapoints', y='mean', hue='index', kind='line',
col='klinger', col_wrap=4, col_order=klinger_categories
)

Related

Plotting multiple datasets in single graph

I have many datasets taken from multiple excel files that I would like to plot on the same graph each with a different color.
I have created 4 spreadsheets with random data for testing.
The first column defines the measurement, the code should select one of this containing 5 rows of data (X, Y), and add them to a dataframe. The results should be 1 dataset for every file to be plot all together on the same graph and having each plot of a different color.
Spreadsheets
I have been using modified pieces of codes taken on here from people which were trying to do the same thing. The problem is that I cannot color each plot differently because the program counts them as one, because due to the pd.concat() it merge these into 1 line. Do you know how I could overcome this?
Other questions asking to plot multiple datasets in single graph are almost all about a small number of dataset, while in my case I have like 50, thus cannot create a subplot for each one of them, unless there is a way to do this automatically
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import os
from os import path
import sys
import openpyxl
# create a list of all excel files in the directory
xlsx_files=glob.glob(r'C:\Users\exx762\Desktop\*.xlsx')
files=[]
n=len(xlsx_files)
index=0
# select chunk of data needed from each file and add to dataframe
for file in xlsx_files:
index+=1
files.append(pd.read_excel(file))
df_files=pd.concat(files)
ph_loops=df_files[df_files['Measurement']==2]
x = ph_loops['X']
y = ph_loops['Y']
# plot elements in the dataframe
ax=plt.subplot()
colors=plt.cm.jet(np.linspace(0, 1, n))
ax.set_prop_cycle('color', list(colors))
ax.plot(x, y, marker='.', c=colors[index-1], linewidth=0.5, markersize=2)
print(colors[index-1])
ax.tick_params(axis='y', color='k')
ax.set_xlabel('X', fontsize=12, weight='bold')
ax.set_ylabel('Y', fontsize=12, weight='bold')
ax.set_title(file+'\n')
ax.tick_params(width=2)
plt.plot()
plt.show()
> Actual result
You can add an id field (I used name below) to the dataframes as you concatenate them, then you can plot in a loop. Example:
# Create example dataframes
dfs = []
for i in range(1, 4):
df = pd.DataFrame(np.random.randn(10, 2), columns=['x', 'y'])
df.insert(0, 'name', i)
dfs.append(df)
result = pd.concat(dfs, ignore_index=True)
# Plot
fig, ax = plt.subplots()
for name, group in result.groupby('name'):
group.plot(x='x', y='y', ax=ax, label=name)
plt.show()

How to plot a histogram with plot.hist for continous data in a dataframe in pandas?

In this data set I need to plot,pH as the x-column which is having continuous data and need to group it together the pH axis as per the quality value and plot the histogram. In many of the resources I referred I found solutions for using random data generated. I tried this piece of code.
plt.hist(, density=True, bins=1)
plt.ylabel('quality')
plt.xlabel('pH');
Where I eliminated the random generated data, but I received and error
File "<ipython-input-16-9afc718b5558>", line 1
plt.hist(, density=True, bins=1)
^
SyntaxError: invalid syntax
What is the proper way to plot my data?I want to feed into the histogram not randomly generated data, but data found in the data set.
Your Error
The immediate problem in your code is the missing data to the plt.hist() command.
plt.hist(, density=True, bins=1)
should be something like:
plt.hist(data_table['pH'], density=True, bins=1)
Seaborn histplot
But this doesn't get the plot broken down by quality. The answer by Mr.T looks correct, but I'd also suggest seaborn which works with "melted" data like you have. The histplot command should give you what you want:
import seaborn as sns
sns.histplot(data=df, x="pH", hue="quality", palette="Dark2", element='step')
Assuming the table you posted is in a pandas.DataFrame named df with columns "pH" and "quality", you get something like:
The palette (Dark2) can can be any matplotlib colormap.
Subplots
If the overlaid histograms are too hard to see, an option is to do facets or small multiples. To do this with pandas and matplotlib:
# group dataframe by quality values
data_by_qual = df.groupby('quality')
# create a sub plot for each quality group
fig, axes = plt.subplots(nrows=len(data_by_qual),
figsize=[6,12],
sharex=True)
fig.subplots_adjust(hspace=.5)
# loop over axes and quality groups together
for ax, (quality, qual_data) in zip(axes, data_by_qual):
ax.hist(qual_data['pH'], bins=10)
ax.set_title(f"quality = {quality}")
ax.set_xlabel('pH')
Altair Facets
The plotting library altair can do this for you:
import altair as alt
alt.Chart(df).mark_bar().encode(
alt.X("pH:Q", bin=True),
y='count()',
).facet(row='quality')
Several possibilities here to represent multiple histograms. All have in common that the data have to be transformed from long to wide format - meaning, each category is in its own column:
import matplotlib.pyplot as plt
import pandas as pd
#test data generation
import numpy as np
np.random.seed(123)
n=300
df = pd.DataFrame({"A": np.random.randint(1, 100, n), "pH": 3*np.random.rand(n), "quality": np.random.choice([3, 4, 5, 6], n)})
df.pH += df.quality
#instead of this block you have to read here your stored data, e.g.,
#df = pd.read_csv("my_data_file.csv")
#check that it read the correct data
#print(df.dtypes)
#print(df.head(10))
#bringing the columns in the required wide format
plot_df = df.pivot(columns="quality")["pH"]
bin_nr=5
#creating three subplots for different ways to present the same histograms
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(6, 12))
ax1.hist(plot_df, bins=bin_nr, density=True, histtype="bar", label=plot_df.columns)
ax1.legend()
ax1.set_title("Basically bar graphs")
plot_df.plot.hist(stacked=True, bins=bin_nr, density=True, ax=ax2)
ax2.set_title("Stacked histograms")
plot_df.plot.hist(alpha=0.5, bins=bin_nr, density=True, ax=ax3)
ax3.set_title("Overlay histograms")
plt.show()
Sample output:
It is not clear, though, what you intended to do with just one bin and why your y-axis was labeled "quality" when this axis represents the frequency in a histogram.

python-plotly multiple lines in same graph with same Y axis

I have a csv file that looks like this:
time,price,m1,m2,m3,m4,m5,m6,m7,m8,buy/sell
10.30.01,102,105,100.5,103.5,110,100.9,103.02,111,105.0204,
10.30.02,103,104.5,101,104,110.2,101.4,104.03,110.5,104.5204,
10.30.03,104,104,101.5,104.5,110.4,101.9,105.04,110,104.0204,
10.30.04,105,103.5,102,105,110.6,102.4,106.05,109.5,103.5204,
10.30.05,106,103,102.5,105.5,110.8,102.9,107.06,109,103.0204,
10.30.06,107,102.5,103,106,111,103.4,108.07,108.5,102.5204,
10.30.07,108,102,103.5,106.5,111.2,103.9,109.08,108,102.0204,
10.30.08,109,101.5,104,107,111.4,104.4,110.09,107.5,101.5204,BUY
10.30.09,110,101,104.5,107.5,111.6,104.9,111.1,107,101.0204,
10.30.10,111,100.5,105,108,111.8,105.4,112.11,106.5,100.5204,
10.30.11,112,101,105.5,108.5,112,105.9,113.12,106,101.0204,
10.30.12,113,101.5,106,109,112.2,106.4,114.13,105.5,101.5204,SELL
10.30.13,114,102,106.5,109.5,112.4,106.9,115.14,105,102.0204,
10.30.14,115,102.5,107,110,112.6,107.4,116.15,104.5,102.5204,
10.30.15,116,103,107.5,110.5,112.8,107.9,117.16,104,103.0204,BUY
10.30.16,117,103.5,108,111,113,108.4,118.17,103.5,103.5204,
I want to take time in x-axis and price,m1,m2,m3,m4,m5,m6,m7,m8 in Y axis, since its the same range all are in same y-axis as line graphs. and buy/sell column in the same graph as scatter plot. How to do this with plotly ?
sorry for the simple question (if it is one), I tried a lot couldn't crack it. thank you in advance
A great resource for Scatter plot related questions is Plotly's documentation on scatter plots.
Plotting all of the columns price,m1,m2,m3,m4,m5,m6,m7,m8 can be done by looping through a list, and adding each of these columns as a trace.
Then I would recommend that you draw vertical lines in the Scatter plot for each time with BUY or SELL, by iterating through the non-null entries in the buy/sell column and using a shape to create a vertical line. You can also add an arrow and text pointing to each line using an annotation.
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
df = pd.read_csv("buysell.csv")
fig = go.Figure()
cols = ['price','m1','m2','m3','m4','m5','m6','m7','m8']
for col in cols:
fig.add_trace(go.Scatter(
x=df['time'],
y=df[col],
name=col
))
# iterate over any rows with 'BUY' or 'SELL'
for index, row in df.dropna(subset=['buy/sell']).iterrows():
fig.add_shape(
type='line',
x0=row['time'],
y0=0,
x1=row['time'],
y1=1,
yref='paper',
line=dict(
color="red",
width=1,
dash="dot",
)
)
df_max, df_min = df[cols].max().max(), df[cols].min().min()
fig.add_annotation(
x=row['time'],
y=df_max,
text=row['buy/sell'],
showarrow=True,
arrowhead=4,
)
fig.show()

How to add a regression line for the entire data in seaborn.lmplot?

I'm trying to plot the scatter plot in which each point is colored w.r.t the variable Points. Moreover, I want to add the regression line.
import pandas as pd
import urllib3
import seaborn as sns
decathlon = pd.read_csv("https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/decathlon.txt", sep='\t')
g = sns.lmplot(
data = decathlon,
x="100m", y="Long.jump",
hue = 'Points', palette = 'viridis'
)
It seems to me that there are 2 regression lines, one for each group of the data. This is not what I want. I would like to have a regression line for the entire data. Moreover, how can I hide the legend on the right hand side?
Could you please elaborate on how to do so?
You should not use lmplot unless you need to use a FacetGrid to split your dataset in several subplots.
Since the example that you show does not use any of the functionalities provided by FacetGrid, you should instead create your plot using a combination of scatterplot() and regplot()
tips = sns.load_dataset('tips')
ax = sns.scatterplot(data=tips, x="total_bill", y="tip", hue="day")
sns.regplot(data=tips, x="total_bill", y="tip", scatter=False, ax=ax)

MatPlotLib Plot last few items differently

I'm exploring MatPlotLib and would like to know if it is possible to show last few items in a dataset differently.
Example: If my dataset contains 100 numbers, I want to display last 5 items in different color.
So far I could do it with one last record using annotate, but want to show last few items dotted with 'red' color as against the blue line.
I could finally achieve this by changing few things in my code.
Below is what I have done.
Let me know in case there is a better way. :)
series_df = pd.read_csv('my_data.csv')
series_df = series_df.fillna(0)
series_df = series_df.sort_values(['Date'], ascending=True)
# Created a new DataFrame for last 5 items series_df2
plt.plot(series_df["Date"],series_df["Values"],color="red", marker='+')
plt.plot(series_df2["Date"],series_df2["Values"],color="blue", marker='+')
You should add some minimal code example or a figure with the desired output to make your question clear. It seems you want to highlight some of the last few points with a marker. You can achieve this by calling plot() twice:
import numpy as np
import matplotlib.pyplot as plt
N = 50
x = np.arange(N)
y = np.random.rand(N)
plt.figure()
plt.plot(x, y)
plt.plot(x[-5:], y[-5:], ls='', c='tab:red', marker='.', ms=10)

Resources