Too Many Indices For Array when using matplotlib - python-3.x

Thank you for taking time to read this question.
I am trying to plot pie charts in one row. The number of pie charts will depend on the result returned.
import matplotlib.pyplot as plt
import numpy as np
fig, axs = plt.subplots(1,len(to_plot_arr))
labels = ['Label1','Label2','Label3','Label4']
pos = 0
for scope in to_plot_arr:
if data["summary"][scope]["Count"] > 0:
pie_data = np.array(db_data)
axs[0,pos].pie(pie_data,labels=labels)
axs[0,pos].set_title(scope)
pos += 1
plt.show()
In the code, db_data looks like: [12,75,46,29]
When I execute the code above, I get the following error message:
Exception has occurred: IndexError
too many indices for array: array is 1-dimensional, but 2 were indexed
I've tried searching for what could be causing this problem, but just can't find any solution to it. I'm not sure what is meant by "but 2 were indexed"
I've tried generating a pie cahrt with :
y = np.array(db_data)
plt.pie(y)
plt.show()
And it generates the pie chart as expected. So, I'm not sure what is meant by "too many indices for array" which array is being referred to and how to resolve this.
Hope you are able to help me with this.
Thank You Again.

Notice that the axs you create in line 4 is of shape (len(to_plot_arr),) i.e., is 1D array, but in the loop in lines 11 and 12 you provide it 2 indices, which tells the interpreter that it is a 2D array, and conflicts with its actual shape.
Here is a fix:
import matplotlib.pyplot as plt
import numpy as np
fig, axs = plt.subplots(1,len(to_plot_arr))
labels = ['Label1','Label2','Label3','Label4']
pos = 0
for scope in to_plot_arr:
if data["summary"][scope]["Count"] > 0:
pie_data = np.array(db_data)
axs[pos].pie(pie_data,labels=labels)
axs[pos].set_title(scope)
pos += 1
plt.show()
Cheers.

So, I think this not technically and answer because I still don't know what was causing the error, but I found a way to solve my problem while still achieving my desired output.
Firstly, I realised, when I changed:
fig, axs = plt.subplots(1,len(to_plot_arr))
to:
fig, axs = plt.subplots(2,len(to_plot_arr)),
the figure could be drawn. So, I continued to try with other variations like (1,2),(2,1),(1,3) and always found that if nrows`` or ncols``` was 1, the error would come up.
Fortunately, for my use case, the layout I required was with 2 rows with the first row being one column, spanning 2 and the bottom row being 2 columns.
So, (2,2) fit my use case very well.
Then I set out to get the top row to span 2 columns and found out that this is best done with GridSpec in Matplotlib. While trying to figure out how to use GridSpec, I came to learn that using add_subplot() would be a better route with more flexibility.
So, my final code looks something like:
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.gridspec import GridSpec
def make_chart():
fig = plt.figure()
fig.set_figheight(8)
fig.set_figwidth(10)
# Gridspec is used to specify the grid distribution of the figure
gs = GridSpec(2,len(to_plot_arr),figure=fig)
# This allows for the first row to span all the columns
r1 = fig.add_subplot(gs[0,:])
tbl = plt.table(
cellText = summary_data,
rowLabels = to_plot_arr,
colLabels = config["Key3"],
loc ='upper left',
cellLoc='center'
)
tbl.set_fontsize(20)
tbl.scale(1,3)
r1.axis('off')
pos = 0
for scope in to_plot_arr:
if data["Key1"][scope][0] > 0:
pie_data = np.array(data["Key2"][scope])
# Add a chart at the specified position
r2 = fig.add_subplot(gs[1,pos])
r2.pie(pie_data, autopct=make_autopct(pie_data))
r2.set_title(config["Key3"][scope])
pos += 1
fig.suptitle(title, fontsize=24)
plt.xticks([])
plt.yticks([])
fig.legend(labels,loc="center left",bbox_to_anchor=(0,0.25))
plt.savefig(savefile)
return filename
This was my first go at trying to use Matplotlib, the learning curve has been steep but with a little of patients and attention to the documentation, I was able to complete my task. I'm sure that there's better ways to do what I did. If you do know a better way or know how to explain the error I was encountering, please do add an answer to this.
Thank You!

Related

How to increase the size of the figure by percentage but keep the original aspect ratio?

I have the following code to draw a figure
import pandas as pd
import urllib3
import seaborn as sns
decathlon = pd.read_csv("https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/decathlon.txt", sep='\t')
fig = sns.scatterplot(data = decathlon,
x = '100m', y = 'Long.jump',
hue = 'Points', palette = 'viridis')
sns.regplot(data = decathlon,
x = '100m', y = 'Long.jump',
scatter = False)
I read answers for similar questions and they use the option plt.figure(figsize=(20,10)). I would like to keep the original aspect (the ration of width to length), but increase the size of the figure by some percentage for better look.
Could you please elaborate on how to do so?
I forgot to add a line %config InlineBackend.figure_format = 'svg' in above code. When I add this line below answer unfortunately does not work.
First, the object returned by scatterplot() is an Axes, not a figure. scatterplot() uses the current axes to draw the plot. If there is no current axes, then matplotlib automatically creates one in the current figure. If there is not current figure, then matplotlib automatically creates a new figure.
The size of this figure is determined by the value in rcParams['figure.figsize']. Therefore, you should create a figure that has the same aspect ratio as defined in this variable before calling your plots.
For instance, the code below creates a figure that's 2x the size of the default figure.
tips = sns.load_dataset('tips')
fig = plt.figure(figsize= 2 * np.array(plt.rcParams['figure.figsize']))
ax = sns.scatterplot(data=tips, x="total_bill", y="tip", hue="day")
sns.regplot(data=tips, x="total_bill", y="tip", scatter=False, ax=ax)

Gantt Chart for USGS Hydrology Data with Python?

I have a compiled a dataframe that contains USGS streamflow data at several different streamgages. Now I want to create a Gantt chart similar to this. Currently, my data has columns as site names and a date index as rows.
Here is a sample of my data.
The problem with the Gantt chart example I linked is that my data has gaps between the start and end dates that would normally define the horizontal time-lines. Many of the examples I found only account for the start and end date, but not missing values that may be in between. How do I account for the gaps where there is no data (blanks or nan in those slots for values) for some of the sites?
First, I have a plot that shows where the missing data is.
import missingno as msno
msno.bar(dfp)
Now, I want time on the x-axis and a horizontal line on the y-axis that tracks when the sites contain data at those times. I know how to do this the brute force way, which would mean manually picking out the start and end dates where there is valid data (which I made up below).
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as dt
df=[('RIO GRANDE AT EMBUDO, NM','2015-7-22','2015-12-7'),
('RIO GRANDE AT EMBUDO, NM','2016-1-22','2016-8-5'),
('RIO GRANDE DEL RANCHO NEAR TALPA, NM','2014-12-10','2015-12-14'),
('RIO GRANDE DEL RANCHO NEAR TALPA, NM','2017-1-10','2017-11-25'),
('RIO GRANDE AT OTOWI BRIDGE, NM','2015-8-17','2017-8-21'),
('RIO GRANDE BLW TAOS JUNCTION BRIDGE NEAR TAOS, NM','2015-9-1','2016-6-1'),
('RIO GRANDE NEAR CERRO, NM','2016-1-2','2016-3-15'),
]
df=pd.DataFrame(data=df)
df.columns = ['A', 'Beg', 'End']
df['Beg'] = pd.to_datetime(df['Beg'])
df['End'] = pd.to_datetime(df['End'])
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
ax = ax.xaxis_date()
ax = plt.hlines(df['A'], dt.date2num(df['Beg']), dt.date2num(df['End']))
How do I make a figure (like the one shown above) with the dataframe I provided as an example? Ideally I want to avoid the brute force method.
Please note: values of zero are considered valid data points.
Thank you in advance for your feedback!
Find date ranges of non-null data
2020-02-12 Edit to clarify logic in loop
df = pd.read_excel('Downloads/output.xlsx', index_col='date')
Make sure the dates are in order:
df.sort_index(inplace=True)
Loop thru the data and find the edges of the good data ranges. Get the corresponding index values and the name of the gauge and collect them all in a list:
# Looping feels like defeat. However, I'm not clever enough to avoid it
good_ranges = []
for i in df:
col = df[i]
gauge_name = col.name
# Start of good data block defined by a number preceeded by a NaN
start_mark = (col.notnull() & col.shift().isnull())
start = col[start_mark].index
# End of good data block defined by a number followed by a Nan
end_mark = (col.notnull() & col.shift(-1).isnull())
end = col[end_mark].index
for s, e in zip(start, end):
good_ranges.append((gauge_name, s, e))
good_ranges = pd.DataFrame(good_ranges, columns=['gauge', 'start', 'end'])
Plotting
Nothing new here. Copied pretty much straight from your question:
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
ax = ax.xaxis_date()
ax = plt.hlines(good_ranges['gauge'],
dt.date2num(good_ranges['start']),
dt.date2num(good_ranges['end']))
fig.tight_layout()
Here's an approach that you could use, it's a bit hacky so perhaps some else will produce a better solution but it should produce your desired output. First use pd.where to replace non NaN values with an integer which will later determine the position of the lines on y-axis later, I do this row by row so that all data which belongs together will be at the same height. If you want to increase the spacing between the lines of the gantt chart you can add a number to i, I've provided an example in the comments in the code block below.
The y-labels and their positions are produced in the data munging steps, so this method will work regardless of the number of columns and will position the labels correctly when you change the spacing described above.
This approach returns matplotlib.pyplot.axes and matplotlib.pyplot.Figure object, so you can adjust the asthetics of the chart to suit your purposes (i.e. change the thickness of the lines, colours etc.). Link to docs.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_excel('output.xlsx')
dates = pd.to_datetime(df.date)
df.index = dates
df = df.drop('date', axis=1)
new_rows = [df[s].where(df[s].isna(), i) for i, s in enumerate(df, 1)]
# To increase spacing between lines add a number to i, eg. below:
# [df[s].where(df[s].isna(), i+3) for i, s in enumerate(df, 1)]
new_df = pd.DataFrame(new_rows)
### Plotting ###
fig, ax = plt.subplots() # Create axes object to pass to pandas df.plot()
ax = new_df.transpose().plot(figsize=(40,10), ax=ax, legend=False, fontsize=20)
list_of_sites = new_df.transpose().columns.to_list() # For y tick labels
x_tick_location = new_df.iloc[:, 0].values # For y tick positions
ax.set_yticks(x_tick_location) # Place ticks in correct positions
ax.set_yticklabels(list_of_sites) # Update labels to site names

How to use fill_between utilizing the where parameter

So following a tutorial, I tried to create a graph using the following code:
time_values = [i for i in range(1,100)]
execution_time = [random.randint(0,100) for i in range(1,100)]
fig = plt.figure()
ax1 = plt.subplot()
threshold=[.8 for i in range(len(execution_time))]
ax1.plot(time_values, execution_time)
ax1.margins(x=-.49, y=0)
ax1.fill_between(time_values,execution_time, 1,where=(execution_time>1), color='r', alpha=.3)
This did not work as I got an error saying I could not compare a list and an int.
However, I then tried:
ax1.fill_between(time_values,execution_time, 1)
And that gave me a graph with all area in between the execution time and the y=1 line, filled in. Since I want the area above the y=1 line filled in, with the area below left un-shaded, I created a list called threshold, and populated it with 1 so that I could recreate the comparison. However,
ax1.fill_between(time_values,execution_time, 1,where=(execution_time>threshold)
and
ax1.fill_between(time_values,execution_time, 1)
create the exact same graph, even though the execution times values do go beyond 1.
I am confused for two reasons:
firstly, in the tutorial I was watching, the teacher was able to successfully compare a list and an integer within the fill_between function, why was I not able to do this?
Secondly, why is the where parameter not identifying the regions I want to fill? Ie, why is the graph shading in the areas between the y=1 and the value of the execution time?
The problem is mainly due the use of python lists instead of numpy arrays. Clearly you could use lists, but then you need to use them throughout the code.
import numpy as np
import matplotlib.pyplot as plt
time_values = list(range(1,100))
execution_time = [np.random.randint(0,100) for _ in range(len(time_values))]
threshold = 50
fig, ax = plt.subplots()
ax.plot(time_values, execution_time)
ax.fill_between(time_values, execution_time, threshold,
where= [e > threshold for e in execution_time],
color='r', alpha=.3)
ax.set_ylim(0,None)
plt.show()
Better is the use of numpy arrays throughout. It's not only faster, but also easier to code and understand.
import numpy as np
import matplotlib.pyplot as plt
time_values = np.arange(1,100)
execution_time = np.random.randint(0,100, size=len(time_values))
threshold = 50
fig, ax = plt.subplots()
ax.plot(time_values, execution_time)
ax.fill_between(time_values,execution_time, threshold,
where=(execution_time > threshold), color='r', alpha=.3)
ax.set_ylim(0,None)
plt.show()

MatPlotLib Plot last few items differently

I'm exploring MatPlotLib and would like to know if it is possible to show last few items in a dataset differently.
Example: If my dataset contains 100 numbers, I want to display last 5 items in different color.
So far I could do it with one last record using annotate, but want to show last few items dotted with 'red' color as against the blue line.
I could finally achieve this by changing few things in my code.
Below is what I have done.
Let me know in case there is a better way. :)
series_df = pd.read_csv('my_data.csv')
series_df = series_df.fillna(0)
series_df = series_df.sort_values(['Date'], ascending=True)
# Created a new DataFrame for last 5 items series_df2
plt.plot(series_df["Date"],series_df["Values"],color="red", marker='+')
plt.plot(series_df2["Date"],series_df2["Values"],color="blue", marker='+')
You should add some minimal code example or a figure with the desired output to make your question clear. It seems you want to highlight some of the last few points with a marker. You can achieve this by calling plot() twice:
import numpy as np
import matplotlib.pyplot as plt
N = 50
x = np.arange(N)
y = np.random.rand(N)
plt.figure()
plt.plot(x, y)
plt.plot(x[-5:], y[-5:], ls='', c='tab:red', marker='.', ms=10)

Cant get the legend to show correctly on the chart

my legend is showing top right, but rather then stating AAPL and IBM it says one letter. cant figure out whats wrong
import quandl
import pandas as pd
import matplotlib.pyplot as plt
def get_mean_volume(symbol):
df = quandl.get("YAHOO/"+str(symbol))[::-1]
return df[['High', 'Adjusted Close']]
stock = ['AAPL', 'IBM']
for s in stock:
plt.plot(get_mean_volume(s))
plt.legend(s)
plt.ylabel('Price')
plt.xlabel('Date')
This is from the matplotlib.legend() documentation.
To make a legend for lines which already exist on the axes (via plot
for instance), simply call this function with an iterable of strings,
one for each legend item. For example:
plt.plot([1, 2, 3])
plt.legend(['A simple line'])
You should probably also add a plt.show().
So since you dont use any labels I think you should use:
plt.legend([s])
The error that you only see one letter is probably caused by the fact that legend iterates over the input (s="AAPL") and takes the first item (s[0]) for the label text for line 1 (s[0] is 'A').
For the second iteration of the loop the same happens for the 'I' (Because s[0]='I' in this case. s1 = 'B' and so on... )
legend() seems pretty customizable just check the matplotlib docs.
So this is the result for me:
import matplotlib.pyplot as plt
stock = ['AAPL']
for s in stock:
plt.plot([1,2,3])
plt.legend([s])
plt.ylabel('Price')
plt.xlabel('Date')
plt.show()
Results in:

Resources