Plotting datetimes in matplotlib producing many colors - python-3.x

I new to python, trying to plot datetime data in matlibplot, but getting a strange result - I can only plot points and they are myriad different colors. I am using plot_date().
I tried generating a workable example but the problem wouldn't show up there (see below). So here is a sample of the database that is giving problems.
import pandas as pd
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
#get a sense of what the data looks like:
data.head()
out:
date variable value unit
0 2020-04-17 10:30:02.309433 Temperature 20.799999 C
2 2020-04-17 10:45:12.089008 Temperature 20.799999 C
4 2020-04-17 11:00:07.033692 Temperature 20.799999 C
6 2020-04-17 11:15:04.457991 Temperature 20.799999 C
8 2020-04-17 11:30:04.996910 Temperature 20.799999 C
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 99 entries, 0 to 196
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 99 non-null object
1 variable 99 non-null object
2 value 98 non-null float64
3 unit 99 non-null object
dtypes: float64(1), object(3)
memory usage: 3.9+ KB
#convert date variable to datetime
data['date'] = pd.to_datetime(data['date'])
#plot with plot_date, calling date2num on date variable
plt.plot_date([mdates.date2num(data['date'])], [data['value']])
Gives:
Why am I getting all these colored points? When I build a small data set of three time periods I don't see this behavior. Instead I get three blue points:
#create dataframe
df = pd.DataFrame({'time': ['2020-04-17 10:30:02.309433', '2020-04-17 10:30:02.309455', '2020-04-17 10:45:12.089008'],
'value': [20.799999, 41.099998, 47.599998]})
#change time variable to datetime object
df['time'] = pd.to_datetime(df['time'])
#plot
plt.plot_date(mdates.date2num(df['time']), df['value'])
Gives three blue dots as expected:
Finally, how can I produce a line plot using plot_date(). The only way I have seen to do this is using: datetime.datime.now() date formats and calling pyplot.plot() - see second answer here: Plotting time in Python with Matplotlib

The difference between plt.plot_date([mdates.date2num(data['date'])], [data['value']]) and plt.plot_date(mdates.date2num(df['time']), df['value']) is that you have an extra set of square brackets.
As for the line, add fmt='-' option to plot_date

Related

how to set datetime type index for weekly column in pandas dataframe

I have a data as given below:
date product price amount
201901 A 10 20
201902 A 10 20
201903 A 20 30
201904 C 40 50
This data is saved in test.txt file.
Date column is given as a weekly column as a concatenation of year and weekid. I am trying to set the date column as an index, with given code:
import pandas as pd
import numpy as np
data=pd.read_csv("test.txt", sep="\t", parse_dates=['date'])
But it gives an error. How can I set the date column as an index with datetime type?
Use index_col parameter for setting index:
data=pd.read_csv("test.txt", sep="\t", index_col=[0])
EDIT: Using column name as index:
data=pd.read_csv("test.txt", sep="\t", index_col=['date'])
For converting index from int to date time, do this:
data.index = pd.to_datetime(data.index, format='%Y%m')
There might be simpler solutions than this too, using apply first I converted your Year-Weekid into Year-month-day format and then just simply used set_index to make date as index column.
import pandas as pd
data ={
'date' : [201901,201902,201903,201904,201905],
'product' : ['A','A','A','C','C'],
'price' : [10,10,10,20,20],
'amount' : [20,20,30,50,60]
}
df = pd.DataFrame(data)
# str(x)+'1' converts to Year-WeekId-Weekday, so 1 represents `Monday` so 2019020
# means 2019 Week2 Monday.
# If you want you can try with other formats too
df['date'] = df['date'].apply(lambda x: pd.to_datetime(str(x)+'1',format='%Y%W%w'))
df.set_index(['date'],inplace=True)
df
Edit:
To see datetime in Year-WeekID format you can style the dataframe as follows, however if you set date as index column following code won't be able to work. And also remember following code just applies some styling so just useful for display purpose only, internally it will remain as date-time object.
df['date'] = df['date'].apply(lambda x: pd.to_datetime(str(x)+'1',format='%Y%W%w'))
style_format = {'date':'{:%Y%W}'}
df.style.format(style_format)
You also can use the date_parser parameter:
import pandas as pd
from io import StringIO
from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%Y%m')
inputtxt = StringIO("""date product price amount
201901 A 10 20
201902 A 10 20
201903 A 20 30
201904 C 40 50""")
df = pd.read_csv(inputtxt, sep='\s+', parse_dates=['date'], date_parser=dateparse)
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 4 non-null datetime64[ns]
1 product 4 non-null object
2 price 4 non-null int64
3 amount 4 non-null int64
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 256.0+ bytes

How to set datetime format for pandas dataframe column labels?

IPNI_RNC PATHID 2020-11-11 00:00:00 2020-11-12 00:00:00 2020-11-13 00:00:00 2020-11-14 00:00:00 2020-11-15 00:00:00 2020-11-16 00:00:00 2020-11-17 00:00:00 Last Day Violation Count
Above are the columns label after reading the excel file. There are 10 columns in df variable after reading the excel and 7 of the columns label are date.
My input data set is an excel file which changes everyday and I want to update it automatically. In excel, some columns label are date like 11-Nov-2020, 12-Nov-2020 but after reading the excel it becomes like 2020-11-11 00:00:00, 2020-11-12 00:00:00. I want to keep column labels as 11-Nov-2020, 12-Nov-2020 while reading excel by pd.read_excel if possible or I need to convert it later.
I am very new in python. Looking forward for your support
Thanks who have already came forward to cooperate me
You can of course use the standard python methods to parse the date values, but I would not recommend it, because this way you end up with python datetime objects and not with the pandas representation of dates. That means, it consumes more space, is probably not as efficient and you can't use the pandas methods to access e.g. the year. I'll show you, what I mean below.
In case you want to avoid the naming issue of your column names, you might want to try to prevent pandas to automatically assign the names and read the first line as data to fix it yourselfe automatically (see the section below about how you can do it).
The type conversion part:
# create a test setup with a small dataframe
import pandas as pd
from datetime import date, datetime, timedelta
df= pd.DataFrame(dict(id=range(10), date_string=[str(datetime.now()+ timedelta(days=d)) for d in range(10)]))
# test the python way:
df['date_val_python']= df['date_string'].map(lambda dt: str(dt))
# use the pandas way: (btw. if you want to explicitely
# specify the format, you can use the format= keyword)
df['date_val_pandas']= pd.to_datetime(df['date_string'])
df.dtypes
The output is:
id int64
date_string object
date_val_python object
date_val_pandas datetime64[ns]
dtype: object
As you can see date_val has type object, this is because it contains python objects of class datetime while date_val_pandas uses the internal datetime representation of pandas. You can now try:
df['date_val_pandas'].dt.year
# this will return a series with the year part of the date
df['date_val_python'].dt.year
# this will result in the following error:
AttributeError: Can only use .dt accessor with datetimelike values
See the pandas doc for to_datetime for more details.
The column naming part:
# read your dataframe as usual
df= pd.read_excel('c:/scratch/tmp/dates.xlsx')
rename_dict= dict()
for old_name in df.columns:
if hasattr(old_name, 'strftime'):
new_name= old_name.strftime('DD-MMM-YYYY')
rename_dict[old_name]= new_name
if len(rename_dict) > 0:
df.rename(columns=rename_dict, inplace=True)
This works, in case your column titles are stored as usual dates, which I suppose is true, because you get a time part after importing them.
strftime of the datetime module is the function you need:
If datetime is a datetime object, you can do
datetime.strftime("%d-%b-%Y")
Example:
>>> from datetime import datetime
>>> timestamp = 1528797322
>>> date_time = datetime.fromtimestamp(timestamp)
>>> print(date_time)
2018-06-12 11:55:22
>>> print(date_time.strftime("%d-%b-%Y"))
12-Jun-2018
In order to apply a function to certain dataframe columns, use:
datetime_cols_list = ['datetime_col1', 'datetime_col2', ...]
for col in dataframe.columns:
if col in datetime_cols_list:
dataframe[col] = dataframe[col].apply(lambda x: x.strftime("%d-%b-%Y"))
I am sure this can be done in multiple ways in pandas, this is just what came out the top of my head.
Example:
import pandas as pd
import numpy as np
np.random.seed(0)
# generate some random datetime values
rng = pd.date_range('2015-02-24', periods=5, freq='T')
other_dt_col = rng = pd.date_range('2016-02-24', periods=5, freq='T')
df = pd.DataFrame({ 'Date': rng, 'Date2': other_dt_col,'Val': np.random.randn(len(rng)) })
print (df)
# Output:
# Date Date2 Val
# 0 2016-02-24 00:00:00 2016-02-24 00:00:00 1.764052
# 1 2016-02-24 00:01:00 2016-02-24 00:01:00 0.400157
# 2 2016-02-24 00:02:00 2016-02-24 00:02:00 0.978738
# 3 2016-02-24 00:03:00 2016-02-24 00:03:00 2.240893
# 4 2016-02-24 00:04:00 2016-02-24 00:04:00 1.867558
datetime_cols_list = ['Date', 'Date2']
for col in df.columns:
if col in datetime_cols_list:
df[col] = df[col].apply(lambda x: x.strftime("%d-%b-%Y"))
print (df)
# Output:
# Date Date2 Val
# 0 24-Feb-2016 24-Feb-2016 1.764052
# 1 24-Feb-2016 24-Feb-2016 0.400157
# 2 24-Feb-2016 24-Feb-2016 0.978738
# 3 24-Feb-2016 24-Feb-2016 2.240893
# 4 24-Feb-2016 24-Feb-2016 1.867558

How to Successfully Produce Mosaic Plots in Pyviz Panel Apps?

I have created the following dataframe df:
Setup:
import pandas as pd
import numpy as np
import random
import copy
import feather
import matplotlib.pyplot as plt
from statsmodels.graphics.mosaicplot import mosaic
import plotly.graph_objects as go
import plotly.express as px
import panel as pn
import holoviews as hv
import geoviews as gv
import geoviews.feature as gf
import cartopy
import cartopy.feature as cf
from geoviews import opts
from cartopy import crs as ccrs
import hvplot.pandas
import colorcet as cc
from colorcet.plotting import swatch
#pn.extension() # commented out as this causes an intermittent javascript error
gv.extension("bokeh")
cols = {"name":["Jim","Alice","Bob","Julia","Fern","Bill","Jordan","Pip","Shelly","Mimi"],
"age":[19,26,37,45,56,71,20,36,37,55],
"sex":["Male","Female","Male","Female","Female","Male","Male","Male","Female","Female"],
"age_band":["18-24","25-34","35-44","45-54","55-64","65-74","18-24","35-44","35-44","55-64"],
"insurance_renew_month":[1,2,3,3,3,4,5,5,6,7],
"postcode_prefix":["EH","M","G","EH","EH","M","G","EH","M","EH"],
"postcode_order":[3,2,1,3,3,2,1,3,2,3],
"local_authority_district":["S12000036","E08000003","S12000049","S12000036","S12000036","E08000003","S12000036","E08000003","S12000049","S12000036"],
"blah1":[3,None,None,8,8,None,1,None,None,None],
"blah2":[None,None,None,33,5,None,66,3,22,3],
"blah3":["A",None,"A",None,"C",None,None,None,None,None],
"blah4":[None,None,None,None,None,None,None,None,None,1]}
df = pd.DataFrame.from_dict(cols)
df
Out[2]:
name age sex age_band ... blah1 blah2 blah3 blah4
0 Jim 19 Male 18-24 ... 3.0 NaN A NaN
1 Alice 26 Female 25-34 ... NaN NaN None NaN
2 Bob 37 Male 35-44 ... NaN NaN A NaN
3 Julia 45 Female 45-54 ... 8.0 33.0 None NaN
4 Fern 56 Female 55-64 ... 8.0 5.0 C NaN
5 Bill 71 Male 65-74 ... NaN NaN None NaN
6 Jordan 20 Male 18-24 ... 1.0 66.0 None NaN
7 Pip 36 Male 35-44 ... NaN 3.0 None NaN
8 Shelly 37 Female 35-44 ... NaN 22.0 None NaN
9 Mimi 55 Female 55-64 ... NaN 3.0 None 1.0
[10 rows x 12 columns]
df[["sex","age_band","postcode_prefix"]] = df[["sex","age_band","postcode_prefix"]].astype("category")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 12 columns):
name 10 non-null object
age 10 non-null int64
sex 10 non-null category
age_band 10 non-null category
insurance_renew_month 10 non-null int64
postcode_prefix 10 non-null category
postcode_order 10 non-null int64
local_authority_district 10 non-null object
blah1 4 non-null float64
blah2 6 non-null float64
blah3 3 non-null object
blah4 1 non-null float64
dtypes: category(3), float64(3), int64(3), object(3)
memory usage: 1.3+ KB
The Problem:
I can successfully create a mosaic plot with the following code:
fig,ax = plt.subplots(figsize=(15,10))
mosaic(df,["sex", "age_band"],ax=ax);
However, I am having issues when I try to create a corresponding app using pn.interact:
categoric_cols = df.select_dtypes(include="category")
cat_atts = categoric_cols.columns.tolist()
cat_atts
Out[4]: ['sex', 'age_band', 'postcode_prefix']
def bivar_cat(x="sex",y="age_band"):
if x in cat_atts and y in cat_atts:
fig,ax = plt.subplots(figsize=(15,10))
return mosaic(df,[x,y],ax=ax);
app_df_cat = pn.interact(bivar_cat,x=cat_atts,y=cat_atts)
app_df_cat
Which results in the following:
The above rendered mosaic plot seems to correspond to the default values of x & y (ie sex & age_band). When you select a new attribute for x or y from the dropdowns, the text above the mosaic plot changes (this text seems to be a string representation of the plot) however the mosaic plot itself does not.
Is my issue possibly related to having to comment out pn.extension()? I have found that when pn.extension() is not commented out, it results in an intermittent javascript error whereby sometimes there is no error raised, sometimes there is an error but my panel app still loads and sometimes there is an error and it crashes my browser. (I have omitted the javascript error here as it can be very large - if it is helpful I can add this to my post.) I would say that the error is raised significantly more often than it is not.
Strangely enough, I haven't observed any difference in other apps that I have created where I have omitted pn.extension() vs including it.
However as the documentation always specifies that you include it, I would have expected that I would have to set my appropriate extensions for all my plots to work correctly? (I have plotly, hvplot, holoviews and geoviews plots successfully plotting in these other apps with and without pn.extension() and pn.extension("plotly") included).
Is it possible to produce panel apps based on mosaic plots?
Thanks
Software Info:
os x Catalina
browser Firefox
python 3.7.5
notebook 6.0.2
pandas 0.25.3
panel 0.7.0
plotly 4.3.0
plotly_express 0.4.1
holoviews 1.12.6
geoviews 1.6.5
hvplot 0.5.2
Statsmodels function mosaic() returns a tuple with a figure and rects.
What you're seeing now via interact is that tuple. This tuple also gets updated in your code when you use the dropdowns.
The figure you see below that is the figure that jupyter automatically plots one time. This one doesn't get updated.
The solution is two-fold:
1) only return the figure, not the tuple
2) prevent jupyter from automatically plotting your figure once with plt.close()
In code:
def bivar_cat(x='sex', y='age_band'):
fig, ax = plt.subplots(figsize=(15,10))
mosaic(df, [x,y], ax=ax)
plt.close()
return fig
app_df_cat = pn.interact(
bivar_cat,
x=cat_atts,
y=cat_atts,
)
app_df_cat

Plotting line graph on the same figure using matplotlib [duplicate]

I have a temperature file with many years temperature records, in a format as below:
2012-04-12,16:13:09,20.6
2012-04-12,17:13:09,20.9
2012-04-12,18:13:09,20.6
2007-05-12,19:13:09,5.4
2007-05-12,20:13:09,20.6
2007-05-12,20:13:09,20.6
2005-08-11,11:13:09,20.6
2005-08-11,11:13:09,17.5
2005-08-13,07:13:09,20.6
2006-04-13,01:13:09,20.6
Every year has different numbers, time of the records, so the pandas datetimeindices are all different.
I want to plot the different year's data in the same figure for comparing . The X-axis is Jan to Dec, the Y-axis is temperature. How should I go about doing this?
Try:
ax = df1.plot()
df2.plot(ax=ax)
If you a running Jupyter/Ipython notebook and having problems using;
ax = df1.plot()
df2.plot(ax=ax)
Run the command inside of the same cell!! It wont, for some reason, work when they are separated into sequential cells. For me at least.
Chang's answer shows how to plot a different DataFrame on the same axes.
In this case, all of the data is in the same dataframe, so it's better to use groupby and unstack.
Alternatively, pandas.DataFrame.pivot_table can be used.
dfp = df.pivot_table(index='Month', columns='Year', values='value', aggfunc='mean')
When using pandas.read_csv, names= creates column headers when there are none in the file. The 'date' column must be parsed into datetime64[ns] Dtype so the .dt extractor can be used to extract the month and year.
import pandas as pd
# given the data in a file as shown in the op
df = pd.read_csv('temp.csv', names=['date', 'time', 'value'], parse_dates=['date'])
# create additional month and year columns for convenience
df['Year'] = df.date.dt.year
df['Month'] = df.date.dt.month
# groupby the month a year and aggreate mean on the value column
dfg = df.groupby(['Month', 'Year'])['value'].mean().unstack()
# display(dfg)
Year 2005 2006 2007 2012
Month
4 NaN 20.6 NaN 20.7
5 NaN NaN 15.533333 NaN
8 19.566667 NaN NaN NaN
Now it's easy to plot each year as a separate line. The OP only has one observation for each year, so only a marker is displayed.
ax = dfg.plot(figsize=(9, 7), marker='.', xticks=dfg.index)
To do this for multiple dataframes, you can do a for loop over them:
fig = plt.figure(num=None, figsize=(10, 8))
ax = dict_of_dfs['FOO'].column.plot()
for BAR in dict_of_dfs.keys():
if BAR == 'FOO':
pass
else:
dict_of_dfs[BAR].column.plot(ax=ax)
This can also be implemented without the if condition:
fig, ax = plt.subplots()
for BAR in dict_of_dfs.keys():
dict_of_dfs[BAR].plot(ax=ax)
You can make use of the hue parameter in seaborn. For example:
import seaborn as sns
df = sns.load_dataset('flights')
year month passengers
0 1949 Jan 112
1 1949 Feb 118
2 1949 Mar 132
3 1949 Apr 129
4 1949 May 121
.. ... ... ...
139 1960 Aug 606
140 1960 Sep 508
141 1960 Oct 461
142 1960 Nov 390
143 1960 Dec 432
sns.lineplot(x='month', y='passengers', hue='year', data=df)

Parse dates and create time series from .csv

I am using a simple csv file which contains data on calory intake. It has 4 columns: cal, day, month, year. It looks like this:
cal month year day
3668.4333 1 2002 10
3652.2498 1 2002 11
3647.8662 1 2002 12
3646.6843 1 2002 13
...
3661.9414 2 2003 14
# data types
cal float64
month int64
year int64
day int64
I am trying to do some simple time series analysis. I hence would like to parse month, year, and day to a single column. I tried the following using pandas:
import pandas as pd
from pandas import Series, DataFrame, Panel
data = pd.read_csv('time_series_calories.csv', header=0, pars_dates=['day', 'month', 'year']], date_parser=True, infer_datetime_format=True)
My questions are: (1) How do I parse the data and (2) define the data type of the new column? I know there are quite a few other similar questions and answers (see e.g. here, here and here) - but I can't make it work so far.
You can use parameter parse_dates where define column names in list in read_csv:
import pandas as pd
import numpy as np
import io
temp=u"""cal,month,year,day
3668.4333,1,2002,10
3652.2498,1,2002,11
3647.8662,1,2002,12
3646.6843,1,2002,13
3661.9414,2,2003,14"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), parse_dates=[['year','month','day']])
print (df)
year_month_day cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414
print (df.dtypes)
year_month_day datetime64[ns]
cal float64
dtype: object
Then you can rename column:
df.rename(columns={'year_month_day':'date'}, inplace=True)
print (df)
date cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414
Or better is pass dictionary with new column name to parse_dates:
df = pd.read_csv(io.StringIO(temp), parse_dates={'dates': ['year','month','day']})
print (df)
dates cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414

Resources