mini scatter matrix using subplots as a loop - python-3.x

I have a dataset with 25 columns and wanted to examine scatter plots. I first looked at it with
Seaborn scatterplot() but this is too messy and there are too many charts to make sense of it all.
So instead I wanted to iterate a single column over all of the columns.
I created this simple loop:
for col in ds_num.columns:
plt.figure()
sns.scatterplot(x='initial_term',y=col,hue='logo_renewal',data=ds_num)
plt.show()
This worked but it gave it in a one column shape. I'd like it to plot for a few in each row so I tried this instead:
for idx, col in enumerate(ds_num.columns):
fig = plt.figure(figsize=(20,16))
ax[idx+1] = fig.add_subplot(5,5,idx+1)
sns.scatterplot(x='initial_term',y=col,hue='logo_renewal',data=ds_num,ax=ax[idx])
plt.show()
But now I got TypeError: 'AxesSubplot' object does not support item assignment
Any suggestions? Thanks

Found the answer with the help of subplots:
fig, axs = plt.subplots(5,5,figsize=(20,20))
cols = ds_num.columns
for ax, col in zip(axs.flatten(),cols):
sns.scatterplot(x='initial_term',y=col,hue='logo_renewal',data=ds_num,ax=ax,legend=False)
plt.tight_layout()
Notice I removed the legend as it took too much space, this is of course not mandatory

Related

Too Many Indices For Array when using matplotlib

Thank you for taking time to read this question.
I am trying to plot pie charts in one row. The number of pie charts will depend on the result returned.
import matplotlib.pyplot as plt
import numpy as np
fig, axs = plt.subplots(1,len(to_plot_arr))
labels = ['Label1','Label2','Label3','Label4']
pos = 0
for scope in to_plot_arr:
if data["summary"][scope]["Count"] > 0:
pie_data = np.array(db_data)
axs[0,pos].pie(pie_data,labels=labels)
axs[0,pos].set_title(scope)
pos += 1
plt.show()
In the code, db_data looks like: [12,75,46,29]
When I execute the code above, I get the following error message:
Exception has occurred: IndexError
too many indices for array: array is 1-dimensional, but 2 were indexed
I've tried searching for what could be causing this problem, but just can't find any solution to it. I'm not sure what is meant by "but 2 were indexed"
I've tried generating a pie cahrt with :
y = np.array(db_data)
plt.pie(y)
plt.show()
And it generates the pie chart as expected. So, I'm not sure what is meant by "too many indices for array" which array is being referred to and how to resolve this.
Hope you are able to help me with this.
Thank You Again.
Notice that the axs you create in line 4 is of shape (len(to_plot_arr),) i.e., is 1D array, but in the loop in lines 11 and 12 you provide it 2 indices, which tells the interpreter that it is a 2D array, and conflicts with its actual shape.
Here is a fix:
import matplotlib.pyplot as plt
import numpy as np
fig, axs = plt.subplots(1,len(to_plot_arr))
labels = ['Label1','Label2','Label3','Label4']
pos = 0
for scope in to_plot_arr:
if data["summary"][scope]["Count"] > 0:
pie_data = np.array(db_data)
axs[pos].pie(pie_data,labels=labels)
axs[pos].set_title(scope)
pos += 1
plt.show()
Cheers.
So, I think this not technically and answer because I still don't know what was causing the error, but I found a way to solve my problem while still achieving my desired output.
Firstly, I realised, when I changed:
fig, axs = plt.subplots(1,len(to_plot_arr))
to:
fig, axs = plt.subplots(2,len(to_plot_arr)),
the figure could be drawn. So, I continued to try with other variations like (1,2),(2,1),(1,3) and always found that if nrows`` or ncols``` was 1, the error would come up.
Fortunately, for my use case, the layout I required was with 2 rows with the first row being one column, spanning 2 and the bottom row being 2 columns.
So, (2,2) fit my use case very well.
Then I set out to get the top row to span 2 columns and found out that this is best done with GridSpec in Matplotlib. While trying to figure out how to use GridSpec, I came to learn that using add_subplot() would be a better route with more flexibility.
So, my final code looks something like:
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.gridspec import GridSpec
def make_chart():
fig = plt.figure()
fig.set_figheight(8)
fig.set_figwidth(10)
# Gridspec is used to specify the grid distribution of the figure
gs = GridSpec(2,len(to_plot_arr),figure=fig)
# This allows for the first row to span all the columns
r1 = fig.add_subplot(gs[0,:])
tbl = plt.table(
cellText = summary_data,
rowLabels = to_plot_arr,
colLabels = config["Key3"],
loc ='upper left',
cellLoc='center'
)
tbl.set_fontsize(20)
tbl.scale(1,3)
r1.axis('off')
pos = 0
for scope in to_plot_arr:
if data["Key1"][scope][0] > 0:
pie_data = np.array(data["Key2"][scope])
# Add a chart at the specified position
r2 = fig.add_subplot(gs[1,pos])
r2.pie(pie_data, autopct=make_autopct(pie_data))
r2.set_title(config["Key3"][scope])
pos += 1
fig.suptitle(title, fontsize=24)
plt.xticks([])
plt.yticks([])
fig.legend(labels,loc="center left",bbox_to_anchor=(0,0.25))
plt.savefig(savefile)
return filename
This was my first go at trying to use Matplotlib, the learning curve has been steep but with a little of patients and attention to the documentation, I was able to complete my task. I'm sure that there's better ways to do what I did. If you do know a better way or know how to explain the error I was encountering, please do add an answer to this.
Thank You!

python-plotly multiple lines in same graph with same Y axis

I have a csv file that looks like this:
time,price,m1,m2,m3,m4,m5,m6,m7,m8,buy/sell
10.30.01,102,105,100.5,103.5,110,100.9,103.02,111,105.0204,
10.30.02,103,104.5,101,104,110.2,101.4,104.03,110.5,104.5204,
10.30.03,104,104,101.5,104.5,110.4,101.9,105.04,110,104.0204,
10.30.04,105,103.5,102,105,110.6,102.4,106.05,109.5,103.5204,
10.30.05,106,103,102.5,105.5,110.8,102.9,107.06,109,103.0204,
10.30.06,107,102.5,103,106,111,103.4,108.07,108.5,102.5204,
10.30.07,108,102,103.5,106.5,111.2,103.9,109.08,108,102.0204,
10.30.08,109,101.5,104,107,111.4,104.4,110.09,107.5,101.5204,BUY
10.30.09,110,101,104.5,107.5,111.6,104.9,111.1,107,101.0204,
10.30.10,111,100.5,105,108,111.8,105.4,112.11,106.5,100.5204,
10.30.11,112,101,105.5,108.5,112,105.9,113.12,106,101.0204,
10.30.12,113,101.5,106,109,112.2,106.4,114.13,105.5,101.5204,SELL
10.30.13,114,102,106.5,109.5,112.4,106.9,115.14,105,102.0204,
10.30.14,115,102.5,107,110,112.6,107.4,116.15,104.5,102.5204,
10.30.15,116,103,107.5,110.5,112.8,107.9,117.16,104,103.0204,BUY
10.30.16,117,103.5,108,111,113,108.4,118.17,103.5,103.5204,
I want to take time in x-axis and price,m1,m2,m3,m4,m5,m6,m7,m8 in Y axis, since its the same range all are in same y-axis as line graphs. and buy/sell column in the same graph as scatter plot. How to do this with plotly ?
sorry for the simple question (if it is one), I tried a lot couldn't crack it. thank you in advance
A great resource for Scatter plot related questions is Plotly's documentation on scatter plots.
Plotting all of the columns price,m1,m2,m3,m4,m5,m6,m7,m8 can be done by looping through a list, and adding each of these columns as a trace.
Then I would recommend that you draw vertical lines in the Scatter plot for each time with BUY or SELL, by iterating through the non-null entries in the buy/sell column and using a shape to create a vertical line. You can also add an arrow and text pointing to each line using an annotation.
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
df = pd.read_csv("buysell.csv")
fig = go.Figure()
cols = ['price','m1','m2','m3','m4','m5','m6','m7','m8']
for col in cols:
fig.add_trace(go.Scatter(
x=df['time'],
y=df[col],
name=col
))
# iterate over any rows with 'BUY' or 'SELL'
for index, row in df.dropna(subset=['buy/sell']).iterrows():
fig.add_shape(
type='line',
x0=row['time'],
y0=0,
x1=row['time'],
y1=1,
yref='paper',
line=dict(
color="red",
width=1,
dash="dot",
)
)
df_max, df_min = df[cols].max().max(), df[cols].min().min()
fig.add_annotation(
x=row['time'],
y=df_max,
text=row['buy/sell'],
showarrow=True,
arrowhead=4,
)
fig.show()

Gantt Chart for USGS Hydrology Data with Python?

I have a compiled a dataframe that contains USGS streamflow data at several different streamgages. Now I want to create a Gantt chart similar to this. Currently, my data has columns as site names and a date index as rows.
Here is a sample of my data.
The problem with the Gantt chart example I linked is that my data has gaps between the start and end dates that would normally define the horizontal time-lines. Many of the examples I found only account for the start and end date, but not missing values that may be in between. How do I account for the gaps where there is no data (blanks or nan in those slots for values) for some of the sites?
First, I have a plot that shows where the missing data is.
import missingno as msno
msno.bar(dfp)
Now, I want time on the x-axis and a horizontal line on the y-axis that tracks when the sites contain data at those times. I know how to do this the brute force way, which would mean manually picking out the start and end dates where there is valid data (which I made up below).
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as dt
df=[('RIO GRANDE AT EMBUDO, NM','2015-7-22','2015-12-7'),
('RIO GRANDE AT EMBUDO, NM','2016-1-22','2016-8-5'),
('RIO GRANDE DEL RANCHO NEAR TALPA, NM','2014-12-10','2015-12-14'),
('RIO GRANDE DEL RANCHO NEAR TALPA, NM','2017-1-10','2017-11-25'),
('RIO GRANDE AT OTOWI BRIDGE, NM','2015-8-17','2017-8-21'),
('RIO GRANDE BLW TAOS JUNCTION BRIDGE NEAR TAOS, NM','2015-9-1','2016-6-1'),
('RIO GRANDE NEAR CERRO, NM','2016-1-2','2016-3-15'),
]
df=pd.DataFrame(data=df)
df.columns = ['A', 'Beg', 'End']
df['Beg'] = pd.to_datetime(df['Beg'])
df['End'] = pd.to_datetime(df['End'])
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
ax = ax.xaxis_date()
ax = plt.hlines(df['A'], dt.date2num(df['Beg']), dt.date2num(df['End']))
How do I make a figure (like the one shown above) with the dataframe I provided as an example? Ideally I want to avoid the brute force method.
Please note: values of zero are considered valid data points.
Thank you in advance for your feedback!
Find date ranges of non-null data
2020-02-12 Edit to clarify logic in loop
df = pd.read_excel('Downloads/output.xlsx', index_col='date')
Make sure the dates are in order:
df.sort_index(inplace=True)
Loop thru the data and find the edges of the good data ranges. Get the corresponding index values and the name of the gauge and collect them all in a list:
# Looping feels like defeat. However, I'm not clever enough to avoid it
good_ranges = []
for i in df:
col = df[i]
gauge_name = col.name
# Start of good data block defined by a number preceeded by a NaN
start_mark = (col.notnull() & col.shift().isnull())
start = col[start_mark].index
# End of good data block defined by a number followed by a Nan
end_mark = (col.notnull() & col.shift(-1).isnull())
end = col[end_mark].index
for s, e in zip(start, end):
good_ranges.append((gauge_name, s, e))
good_ranges = pd.DataFrame(good_ranges, columns=['gauge', 'start', 'end'])
Plotting
Nothing new here. Copied pretty much straight from your question:
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
ax = ax.xaxis_date()
ax = plt.hlines(good_ranges['gauge'],
dt.date2num(good_ranges['start']),
dt.date2num(good_ranges['end']))
fig.tight_layout()
Here's an approach that you could use, it's a bit hacky so perhaps some else will produce a better solution but it should produce your desired output. First use pd.where to replace non NaN values with an integer which will later determine the position of the lines on y-axis later, I do this row by row so that all data which belongs together will be at the same height. If you want to increase the spacing between the lines of the gantt chart you can add a number to i, I've provided an example in the comments in the code block below.
The y-labels and their positions are produced in the data munging steps, so this method will work regardless of the number of columns and will position the labels correctly when you change the spacing described above.
This approach returns matplotlib.pyplot.axes and matplotlib.pyplot.Figure object, so you can adjust the asthetics of the chart to suit your purposes (i.e. change the thickness of the lines, colours etc.). Link to docs.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_excel('output.xlsx')
dates = pd.to_datetime(df.date)
df.index = dates
df = df.drop('date', axis=1)
new_rows = [df[s].where(df[s].isna(), i) for i, s in enumerate(df, 1)]
# To increase spacing between lines add a number to i, eg. below:
# [df[s].where(df[s].isna(), i+3) for i, s in enumerate(df, 1)]
new_df = pd.DataFrame(new_rows)
### Plotting ###
fig, ax = plt.subplots() # Create axes object to pass to pandas df.plot()
ax = new_df.transpose().plot(figsize=(40,10), ax=ax, legend=False, fontsize=20)
list_of_sites = new_df.transpose().columns.to_list() # For y tick labels
x_tick_location = new_df.iloc[:, 0].values # For y tick positions
ax.set_yticks(x_tick_location) # Place ticks in correct positions
ax.set_yticklabels(list_of_sites) # Update labels to site names

Creating 3 pandas series as pie charts in one plot

I'm setting up a figure to display 3 pie charts. Data for the charts come from 3 separate pandas series. I suppose I could merge the series into a df and create subplots via that df but I doubt it's needed.
My current code generates 3 pie charts. But they all overlap. I'm confused about how to arrange them.
S19E_sj = (BDdf.loc[BDdf['GRPCODE'] == 'SJ3219'])['Result'].value_counts()
S19E_ge = (BDdf.loc[BDdf['GRPCODE'] == 'G1932'])['Result'].value_counts()
S19E_jl = (BDdf.loc[BDdf['GRPCODE'] == 'JLG1930'])['Result'].value_counts()
fig, ax = plt.subplots(figsize = (8,6))
S19E_sj.plot.pie()
S19E_ge.plot.pie()
S19E_jl.plot.pie()
Although you failed to provide a Minimal, Complete, and Verifiable example, you can try something like this. Create a figure containing 3 subplots arranged in a row, and then assign them individually to your three pie chart commands
fig, axes = plt.subplots(1, 3, figsize = (8,6))
S19E_sj.plot.pie(ax=axes[0])
S19E_ge.plot.pie(ax=axes[1])
S19E_jl.plot.pie(ax=axes[2])
plt.tight_layout()
plt.show()

matplotlib: autofmt_xdate doesn't work for multiple rows

I am using the autofmt_xdate() to get better looking x-axis (in date) like below:
fig, ax = plt.subplots(1,2, figsize=(12, 5))
ax[0].plot(my_df[['my_time']], my_df[['field_A']])
ax[0].set_xlable('time')
fig.autofmt_xdate()
This works fine. However, if I do two rows like below:
fig, ax = plt.subplots(2,2, figsize=(12, 5))
ax[0][0].plot(my_df[['my_time']], my_df[['field_A']])
ax[0][0].set_xlable('time')
fig.autofmt_xdate()
Then the labels and ticks of ax[0][0] x-axis disappeared. Any idea what I did wrong? Thanks!
You didn't do anything wrong here. What you see is the expected behaviour of fig.autofmt_xdate().
As the documentation says,
The ticklabels are often long, and it helps to rotate them on the bottom subplot and turn them off on other subplots, as well as turn off xlabels.

Resources