Box Whisker plot of date frequency - python-3.x

Good morning all!
I have a Pandas df and Im trying to create a monthly box and whisker of 30 years ofdata.
DataFrame
datetime year month day hour lon lat
0 3/18/1986 10:17 1986 3 18 10 -124.835 46.540
1 6/7/1986 13:38 1986 6 7 13 -121.669 46.376
2 7/17/1986 20:56 1986 7 17 20 -122.436 48.044
3 7/26/1986 2:46 1986 7 26 2 -123.071 48.731
4 8/2/1986 19:54 1986 8 2 19 -123.654 48.480
Trying to see the mean amount of occurrences in X month, the median, and the max/min occurrence ( and date of max and min)..
Ive been playing around with pandas.DataFrame.groupby() but dont fully understand it.
I have grouped the date by month and day occurrences. I like this format:
Code:
df = pd.read_csv(masterCSVPath)
months = df['month']
test = df.groupby(['month','day'])['day'].count()
output: ---->
month day
1 1 50
2 103
3 97
4 29
5 60
...
12 27 24
28 7
29 17
30 18
31 9
So how can i turn that df above into a box/whisker plot?
The x-axis i want to be months..
y axis == occurrences

Try this (without doing groupby):
import matplotlib.pyplot as plt
import seaborn as sns
sns.boxplot(x = 'month', y = 'day', data = df)
In case you want the months to be in Jan, Feb format then try this:
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
df['month_new'] = df['datetime'].dt.strftime('%b')
sns.boxplot(x = 'month_new', y = 'day', data = df)

Related

Python - Plot Multiple Dataframes by Month and Day (Ignore Year)

I have multiple dataframes having different years data.
The data in dataframes are:
>>> its[0].head(5)
Crocs
date
2017-01-01 46
2017-01-08 45
2017-01-15 43
2017-01-22 43
2017-01-29 41
>>> its[1].head(5)
Crocs
date
2018-01-07 23
2018-01-14 21
2018-01-21 23
2018-01-28 21
2018-02-04 25
>>> its[2].head(5)
Crocs
date
2019-01-06 90
2019-01-13 79
2019-01-20 82
2019-01-27 82
2019-02-03 81
I tried to plot all these dataframes in single figure (graph), yeah i accomplished but it was not what i wanted.
I plotted the dataframes using the following code
>>> for p in its:
plt.plot(p.index,p.values)
>>> plt.show()
and i got the following graph
but this is not what i wanted
i want the graph to be like this
Simply i want graph to ignore years and plot by month and days
You can try of converting the datetime index to timeseries integers based on month and date and plot
df3 = pd.concat(its,axis=1)
xindex= df3.index.month*30 + df3.index.day
plt.plot(xindex,df3)
plt.show()
If you want to have datetime information than integers you can add xticks to frame
labels = (df3.index.month*30).astype(str)+"-" + df3.index.day.astype(str)
plt.xticks(df3.index.month*30 + df3.index.day, labels)
plt.show()

Plot values for multiple months and years in Plotly/Dash

I have a Dash dashboard and I need to plot on the x axis months from 0-12 and I need to have multiple lines on the same figure for different years that have been selected, ie 1991-2040. The plotted value is a columns say 'total' in a dataframe. The labels should be years and the total value is on the y axis. My data looks like this:
Month Year Total
0 0 1991 31.4
1 0 1992 31.4
2 0 1993 31.4
3 0 1994 20
4 0 1995 300
.. ... ... ...
33 0 2024 31.4
34 1 2035 567
35 1 2035 10
36 1 2035 3
....
Do I need to group it and how to achieve that in Dash/Plotly?
It seems to me that you should have a look at pd.pivot_table.
%matplotlib inline
import pandas as pd
import numpy as np
import plotly.offline as py
import plotly.graph_objs as go
# create a df
N = 100
df = pd.DataFrame({"Date":pd.date_range(start='1991-01-01',
periods=N,
freq='M'),
"Total":np.random.randn(N)})
df["Month"] = df["Date"].dt.month
df["Year"] = df["Date"].dt.year
# use pivot_table to have years as columns
pv = pd.pivot_table(df,
index=["Month"],
columns=["Year"],
values=["Total"])
# remove multiindex in columns
pv.columns = [col[1] for col in pv.columns]
data = [go.Scatter(x = pv.index,
y = pv[col],
name = col)
for col in pv.columns]
py.iplot(data)

Plotting line graph on the same figure using matplotlib [duplicate]

I have a temperature file with many years temperature records, in a format as below:
2012-04-12,16:13:09,20.6
2012-04-12,17:13:09,20.9
2012-04-12,18:13:09,20.6
2007-05-12,19:13:09,5.4
2007-05-12,20:13:09,20.6
2007-05-12,20:13:09,20.6
2005-08-11,11:13:09,20.6
2005-08-11,11:13:09,17.5
2005-08-13,07:13:09,20.6
2006-04-13,01:13:09,20.6
Every year has different numbers, time of the records, so the pandas datetimeindices are all different.
I want to plot the different year's data in the same figure for comparing . The X-axis is Jan to Dec, the Y-axis is temperature. How should I go about doing this?
Try:
ax = df1.plot()
df2.plot(ax=ax)
If you a running Jupyter/Ipython notebook and having problems using;
ax = df1.plot()
df2.plot(ax=ax)
Run the command inside of the same cell!! It wont, for some reason, work when they are separated into sequential cells. For me at least.
Chang's answer shows how to plot a different DataFrame on the same axes.
In this case, all of the data is in the same dataframe, so it's better to use groupby and unstack.
Alternatively, pandas.DataFrame.pivot_table can be used.
dfp = df.pivot_table(index='Month', columns='Year', values='value', aggfunc='mean')
When using pandas.read_csv, names= creates column headers when there are none in the file. The 'date' column must be parsed into datetime64[ns] Dtype so the .dt extractor can be used to extract the month and year.
import pandas as pd
# given the data in a file as shown in the op
df = pd.read_csv('temp.csv', names=['date', 'time', 'value'], parse_dates=['date'])
# create additional month and year columns for convenience
df['Year'] = df.date.dt.year
df['Month'] = df.date.dt.month
# groupby the month a year and aggreate mean on the value column
dfg = df.groupby(['Month', 'Year'])['value'].mean().unstack()
# display(dfg)
Year 2005 2006 2007 2012
Month
4 NaN 20.6 NaN 20.7
5 NaN NaN 15.533333 NaN
8 19.566667 NaN NaN NaN
Now it's easy to plot each year as a separate line. The OP only has one observation for each year, so only a marker is displayed.
ax = dfg.plot(figsize=(9, 7), marker='.', xticks=dfg.index)
To do this for multiple dataframes, you can do a for loop over them:
fig = plt.figure(num=None, figsize=(10, 8))
ax = dict_of_dfs['FOO'].column.plot()
for BAR in dict_of_dfs.keys():
if BAR == 'FOO':
pass
else:
dict_of_dfs[BAR].column.plot(ax=ax)
This can also be implemented without the if condition:
fig, ax = plt.subplots()
for BAR in dict_of_dfs.keys():
dict_of_dfs[BAR].plot(ax=ax)
You can make use of the hue parameter in seaborn. For example:
import seaborn as sns
df = sns.load_dataset('flights')
year month passengers
0 1949 Jan 112
1 1949 Feb 118
2 1949 Mar 132
3 1949 Apr 129
4 1949 May 121
.. ... ... ...
139 1960 Aug 606
140 1960 Sep 508
141 1960 Oct 461
142 1960 Nov 390
143 1960 Dec 432
sns.lineplot(x='month', y='passengers', hue='year', data=df)

How to combine multiple columns in a Data Frame to Pandas datetime format

I have a pandas data frame with values as below
ProcessID1 UserID Date Month Year Time
248 Tony 29 4 2017 23:30:56
436 Jeff 28 4 2017 20:02:19
500 Greg 4 5 2017 11:48:29
I would like to know is there any way I can combine columns of Date,Month&Year & time to a pd.datetimeformat?
Use to_datetime with automatic convert column Day,Month,Year with add times converted to_timedelta:
df['Datetime'] = pd.to_datetime(df.rename(columns={'Date':'Day'})[['Day','Month','Year']]) + \
pd.to_timedelta(df['Time'])
Another solutions are join all column converted to strings first:
df['Datetime'] = pd.to_datetime(df[['Date','Month','Year', 'Time']]
.astype(str).apply(' '.join, 1), format='%d %m %Y %H:%M:%S')
df['Datetime'] = (pd.to_datetime(df['Year'].astype(str) + '-' +
df['Month'].astype(str) + '-' +
df['Date'].astype(str) + ' ' +
df['Time']))
print (df)
ProcessID1 UserID Date Month Year Time Datetime
0 248 Tony 29 4 2017 23:30:56 2017-04-29 23:30:56
1 436 Jeff 28 4 2017 20:02:19 2017-04-28 20:02:19
2 500 Greg 4 5 2017 11:48:29 2017-05-04 11:48:29
Last if need remove these columns:
df = df.drop(['Date','Month','Year', 'Time'], axis=1)
print (df)
ProcessID1 UserID Datetime
0 248 Tony 2017-04-29 23:30:56
1 436 Jeff 2017-04-28 20:02:19
2 500 Greg 2017-05-04 11:48:29
Concatenate the columns together to a string format and use pd.to_datetime to convert to datetime.
import pandas as pd
import io
txt = """
ProcessID1 UserID Date Month Year Time
248 Tony 29 4 2017 23:30:56
436 Jeff 28 4 2017 20:02:19
500 Greg 4 5 2017 11:48:29
"""
df = pd.read_csv(io.StringIO(txt), sep="[\t ,]+")
df['Datetime'] = pd.to_datetime(df['Date'].astype(str) \
+ '-' + df['Month'].astype(str) \
+ '-' + df['Year'].astype(str) \
+ ' ' + df['Time'],
format='%d-%m-%Y %H:%M:%S')
df
import pandas as pd
You can also do this by using apply() method:-
df['Datetime']=df[['Year','Month','Date']].astype(str).apply('-'.join,1)+' '+df['Time']
Finally convert 'Datetime' to datetime dtype by using pandas to_datetime() method:-
df['Datetime']=pd.to_datetime(df['Datetime'])
Output of df:
ProcessID1 UserID Date Month Year Time Datetime
0 248 Tony 29 4 2017 23:30:56 2017-04-29 23:30:56
1 436 Jeff 28 4 2017 20:02:19 2017-04-28 20:02:19
2 500 Greg 4 5 2017 11:48:29 2017-05-04 11:48:29
Now if you want to remove 'Date','Month','Year' and 'Time' column then use:-
df=df.drop(columns=['Date','Month','Year', 'Time'])

Determining the number of unique entry's left after experiencing a specific item in pandas

I have a data frame with three columns timestamp, lecture_id, and userid
I am trying to write a loop that will count up the number of students who dropped (never seen again) after experiencing a specific lecture. The goal is to ultimately have a fourth column that shows the number of students remaining after exposure to a specific lecture.
I'm having trouble writing this in python, I tried a for loop which never finished (I have 13m rows).
import pandas as pd
import numpy as np
ids = list(np.random.randint(0,5,size=(100, 1)))
users = list(np.random.randint(0,10,size=(100, 1)))
dates = list(pd.date_range('20130101',periods=100, freq = 'H'))
dft = pd.DataFrame(
{'lecture_id': ids,
'userid': users,
'timestamp': dates
})
I want to make a new data frame that shows for every user that experienced x lecture, how many never came back (dropped).
Not sure if this is what you want and also not sure if this can be done simpler but this could be a way to do it:
import pandas as pd
import numpy as np
np.random.seed(42)
ids = list(np.random.randint(0,5,size=(100, 1)[0]))
users = list(np.random.randint(0,10,size=(100, 1)[0]))
dates = list(pd.date_range('20130101',periods=100, freq = 'H'))
df = pd.DataFrame({'lecture_id': ids, 'userid': users, 'timestamp': dates})
# Get the last date for each user
last_seen = df.timestamp.iloc[df.groupby('userid').timestamp.apply(lambda x: np.argmax(x))]
df['remaining'] = len(df.userid.unique())
tmp = np.zeros(len(df))
tmp[last_seen.index] = 1
df['remaining'] = (df['remaining']- tmp.cumsum()).astype(int)
df[-10:]
where the last 10 entries are:
lecture_id timestamp userid remaining
90 2 2013-01-04 18:00:00 9 6
91 0 2013-01-04 19:00:00 5 6
92 2 2013-01-04 20:00:00 6 6
93 2 2013-01-04 21:00:00 3 5
94 0 2013-01-04 22:00:00 6 4
95 2 2013-01-04 23:00:00 7 4
96 4 2013-01-05 00:00:00 0 3
97 1 2013-01-05 01:00:00 5 2
98 1 2013-01-05 02:00:00 7 1
99 0 2013-01-05 03:00:00 4 0

Resources