Pandas Dataframe plot not showing dates when matplotlib.dates used - python-3.x

I have the following code that plots COVID-19 confirmed cases country-wise against some dates.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
df = pd.DataFrame({'Countries': ['Australia', 'India', 'UAE', 'UK'],
'3/1/20': [ 27, 3, 21, 36],
'3/2/20': [ 30, 5, 21, 40],
'3/3/20': [ 39, 5, 27, 51],
'3/4/20': [ 52, 28, 27, 86],
},
index = [0, 1, 2, 3])
print('Datframe:\n')
print(df)
dft=df.T
print('\n Transposed data:\n')
print(dft)
print(dft.columns)
dft.columns=dft.iloc[0]
dft=dft[1:]
print('\n Final data:\n')
print(dft)
dft.plot.bar(align='center')
# Set date ticks with 2-day interval
plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=2))
# Change date format
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%d-%m-%Y'))
''' Note: If I comment above two lines, I get back x-axis ticks. '''
# Autoformatting dates ticks
plt.gcf().autofmt_xdate()
plt.title('COVID-19 confirmed cases')
plt.show()
Here I intended to show the dates on the x-axis ticks with 2-day intervals and get the dates formatted in a different style. However, in the plot, I don't get any ticks and labels on the x-axis as shown in the figure below.
However, when I comment out the instructions with matplotlib.dates, I get back the x-ticks and labels.
Can this be explained and fixed in a simple way? Also, can we get the same result using fig, ax = plt.subplots()?

You were almost there. All you need to do is to restructure your dataframe. index the date. One way to do this is as follows;
Data
df = pd.DataFrame({'Countries': ['Australia', 'India', 'UAE', 'UK'],
'3/1/20': [ 27, 3, 21, 36],
'3/2/20': [ 30, 5, 21, 40],
'3/3/20': [ 39, 5, 27, 51],
'3/4/20': [ 52, 28, 27, 86],
},
index = [0, 1, 2, 3])
df2=df.set_index('Countries').T.unstack().reset_index()
df2#.plot(kind='bar')
df2.columns=['Countries','Date','Count']
df2['Date']=pd.to_datetime(df2['Date'])
df2.dtypes
Coarce date to datetime
df2.set_index('Date', inplace=True)
Groupby date, countries and unstack before plotting
df2.groupby([df2.index.date,df2['Countries']])['Count'].sum().unstack().plot.bar()
Outcome

Related

Python: Plot histograms with customized bins

I am using matplotlib.pyplot to make a histogram. Due to the distribution of the data, I want manually set up the bins. The details are as follows:
Any value = 0 in one bin;
Any value > 60 in the last bin;
Any value > 0 and <= 60 are in between the bins described above and the bin size is 5.
Could you please give me some help? Thank you.
I'm not sure what you mean by "the bin size is 5". You can either plot a histogramm by specifying the bins with a sequence:
import matplotlib.pyplot as plt
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -5] # your data here
plt.hist(data, bins=[0, 0.5, 60, max(data)])
plt.show()
But the bin size will match the corresponding interval, meaning -in this example- that the "0-case" will be barely visible:
(Note that 60 is moved to the last bin when specifying bins as a sequence, changing the sequence to [0, 0.5, 59.5, max(data)] would fix that)
What you (probably) need is first to categorize your data and then plot a bar chart of the categories:
import matplotlib.pyplot as plt
import pandas as pd
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -5] # your data here
df = pd.DataFrame()
df['data'] = data
def find_cat(x):
if x == 0:
return "0"
elif x > 60:
return "> 60"
elif x > 0:
return "> 0 and <= 60"
df['category'] = df['data'].apply(find_cat)
df.groupby('category', as_index=False).count().plot.bar(x='category', y='data', rot=0, width=0.8)
plt.show()
Output:
building off Tranbi's answer, you could specify the bin edges as detailed in the link they shared.
import matplotlib.pyplot as plt
import pandas as pd
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -6] # your data here
df = pd.DataFrame()
df['data'] = data
bin_edges = [-5, 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65]
bin_edges_offset = [x+0.000001 for x in bin_edges]
plt.figure()
plt.hist(df['data'], bins=bin_edges_offset)
plt.show()
histogram
IIUC you want a classic histogram for value between 0 (not included) and 60 (included) and add two bins for 0 and >60 on the side.
In that case I would recommend plotting the 3 regions separately:
import matplotlib.pyplot as plt
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -3] # your data here
fig, axes = plt.subplots(1,3, sharey=True, width_ratios=[1, 12, 1])
fig.subplots_adjust(wspace=0)
# counting 0 values and drawing a bar between -5 and 0
axes[0].bar(-5, data.count(0), width=5, align='edge')
axes[0].xaxis.set_visible(False)
axes[0].spines['right'].set_visible(False)
axes[0].set_xlim((-5, 0))
# histogram between (0, 60]
axes[1].hist(data, bins=12, range=(0.0001, 60.0001))
axes[1].yaxis.set_visible(False)
axes[1].spines['left'].set_visible(False)
axes[1].spines['right'].set_visible(False)
axes[1].set_xlim((0, 60))
# counting values > 60 and drawing a bar between 60 and 65
axes[2].bar(60, len([x for x in data if x > 60]), width=5, align='edge')
axes[2].xaxis.set_visible(False)
axes[2].yaxis.set_visible(False)
axes[2].spines['left'].set_visible(False)
axes[2].set_xlim((60, 65))
plt.show()
Output:
Edit: If you wanna plot probability density, I would edit the data and simply use hist:
import matplotlib.pyplot as plt
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -3] # your data here
data2 = []
for el in data:
if el < 0:
pass
elif el > 60:
data2.append(61)
else:
data2.append(el)
plt.hist(data2, bins=14, density=True, range=(-4.99,65.01))
plt.show()

changing the colour based on the graphs positions matplotlib

With the dataset below i am trying to plot a line graph on matplotlib. I am trying to make a function that looks at the previous number and checks whether the current number is higher. If the current function is bigger it would draw a blue line going to the next point such as it would draw a blue line between (1,100) and (2,9313). If its not greater (6,203542) and (7,203542), a red line would be drawn.
import matplotlib.pyplot as plt
x_long = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
L_Amount_list = [100.00, 9313.38, 43601.28, 61701.69, 74331.88, 198913.81, 153054.54, 119162.10, 74382.25, 203542.82, 160774.71, 220307.19, 366459.26]
plt.plot(x_long,L_Amount_list, color = 'green')
First, create a list of line colors for the graph with thresholds. Next, instead of drawing the graph one at a time, extract the data two at a time and set the list of colors.
import matplotlib.pyplot as plt
x_long = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
L_Amount_list = [100.00, 9313.38, 43601.28, 61701.69, 74331.88, 198913.81, 153054.54, 119162.10, 74382.25, 203542.82, 160774.71, 220307.19, 366459.26]
colors = ['b' if a < b else 'r' for a,b in zip(L_Amount_list,L_Amount_list[1:])]
for i in range(len(x_long)):
try:
plt.plot(x_long[i:i+2], L_Amount_list[i:i+2], color=colors[i])
except:
break
plt.show()

how to plot a single line with different types of line dash using bokeh?

I am trying to plot the line for a set of points. Currently, I have set of points as Column names X, Y and Type in the form of a data frame. Whenever the type is 1, I would like to plot the points as dashed and whenever the type is 2, I would like to plot the points as a solid line.
Currently, I am using for loop to iterate over all points and plot each point using plt.dash. However, this is slowing down my run time since I want to plot more than 40000 points.
So, is an easy way to plot the line overall points with different line dash type?
You could realize it by drawing multiple line segments like this
(Bokeh v1.1.0)
import pandas as pd
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, Range1d, LinearAxis
line_style = {1: 'solid', 2: 'dashed'}
data = {'name': [1, 1, 1, 2, 2, 2, 1, 1, 1, 1],
'counter': [1, 2, 3, 3, 4, 5, 5, 6, 7, 8],
'score': [150, 150, 150, 150, 150, 150, 150, 150, 150, 150],
'age': [20, 21, 22, 22, 23, 24, 24, 25, 26, 27]}
df = pd.DataFrame(data)
plot = figure(y_range = (100, 200))
plot.extra_y_ranges = {"Age": Range1d(19, 28)}
plot.add_layout(LinearAxis(y_range_name = "Age"), 'right')
for i, g in df.groupby([(df.name != df.name.shift()).cumsum()]):
source = ColumnDataSource(g)
plot.line(x = 'counter', y = 'score', line_dash = line_style[g.name.unique()[0]], source = source)
plot.circle(x = 'counter', y = 'age', color = "blue", size = 10, y_range_name = "Age", source = source)
show(plot)

Get position of date on the x axis

I have set the following xlim on my x axis:
axA.set_xlim(datetime.date(2016, 12, 1), datetime.date(2018, 1, 30))
and now I would like to get the position of the 12th of October (2017-10-12) on my X axis, so that I can then put an annotation there.
I tried to figure that out the using date2num and datestr2num:
release_date = datetime.datetime(2017, 10, 12)
print(mdates.date2num(release_date))
print(mdates.datestr2num('2017-10-12'))
print(axA.get_xlim())
The above code output:
-736614.0
736614.0
(17136.0, 17561.0)
First it seems like date2num and datestr2num don't give an identical result, but more problematically, those results are not within the range of xlim.
How can I find the X position of a date (to place an annotation), given the xlim I set above?
Code to reproduce the problem:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from datetime import datetime
def get_dataframe():
values = [12, 16, 20]
dates = [
datetime(2017, 12, 24),
datetime(2017, 12, 23),
datetime(2017, 12, 22)
]
df = pd.DataFrame(data={'date': dates, 'value': values})
df = df.set_index(['date']).sort_index()
return df
def plot(dataA):
fig, axA = plt.subplots()
dataA.plot(ax=axA)
axA.set_xlim(datetime(2016, 12, 1), datetime(2018, 1, 30))
release = datetime(2017, 10, 12)
print(mdates.date2num(release))
print(mdates.datestr2num('2017-10-12'))
print(axA.get_xlim())
df = get_dataframe()
plot(df)
plt.show()
You can use a date object directly if you have a date xaxis:
ax.annotate('hello', xy=(datetime.datetime(2017, 10, 12), 1),
xytext=(datetime.datetime(2017, 10, 12), 5),
arrowprops={'facecolor': 'r'})

How can I create a seaborn regression plot with multiindex dataframe?

I have time series data which are multi-indexed on (Year, Month) as seen here:
print(df.index)
print(df)
MultiIndex(levels=[[2016, 2017], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
labels=[[0, 0, 0, 0, 0, 0, 0, 0], [2, 3, 4, 5, 6, 7, 8, 9]],
names=['Year', 'Month'])
Value
Year Month
2016 3 65.018150
4 63.130035
5 71.071254
6 72.127967
7 67.357795
8 66.639228
9 64.815232
10 68.387698
I want to do very basic linear regression on these time series data. Because pandas.DataFrame.plot does not do any regression, I intend to use Seaborn to do my plotting.
I attempted to do this by using lmplot:
sns.lmplot(x=("Year", "Month"), y="Value", data=df, fit_reg=True)
but I get an error:
TypeError: '>' not supported between instances of 'str' and 'tuple'
This is particularly interesting to me because all elements in df.index.levels[:] are of type numpy.int64, all elements in df.index.labels[:] are of type numpy.int8.
Why am I receiving this error? How can I resolve it?
You can use reset_index to turn the dataframe's index into columns. Plotting DataFrames columns is then straight forward with seaborn.
As I guess the reason to use lmplot would be to show different regressions for different years (otherwise a regplot may be better suited), the "Year"column can be used as hue.
import numpy as np
import pandas as pd
import seaborn.apionly as sns
import matplotlib.pyplot as plt
iterables = [[2016, 2017], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]]
index = pd.MultiIndex.from_product(iterables, names=['Year', 'Month'])
df = pd.DataFrame({"values":np.random.rand(24)}, index=index)
df2 = df.reset_index() # or, df.reset_index(inplace=True) if df is not required otherwise
g = sns.lmplot(x="Month", y="values", data=df2, hue="Year")
plt.show()
Consider the following approach:
df['x'] = df.index.get_level_values(0) + df.index.get_level_values(1)/100
yields:
In [49]: df
Out[49]:
Value x
Year Month
2016 3 65.018150 2016.03
4 63.130035 2016.04
5 71.071254 2016.05
6 72.127967 2016.06
7 67.357795 2016.07
8 66.639228 2016.08
9 64.815232 2016.09
10 68.387698 2016.10
let's prepare X-ticks labels:
labels = df.index.get_level_values(0).astype(str) + '-' + \
df.index.get_level_values(1).astype(str).str.zfill(2)
sns.lmplot(x='x', y='Value', data=df, fit_reg=True)
ax = plt.gca()
ax.set_xticklabels(labels)
Result:

Resources