Aggregate time series with group by and create chart with multiple series - python-3.x

I have time series data and I want to create a chart of the monthly (x-axis) counts of the number of records (lines chart), grouped by sentiment (multiple lines)
Data looks like this
created_at id polarity sentiment
0 Fri Nov 02 11:22:47 +0000 2018 1058318498663870464 0.000000 neutral
1 Fri Nov 02 11:20:54 +0000 2018 1058318026758598656 0.011905 neutral
2 Fri Nov 02 09:41:37 +0000 2018 1058293038739607552 0.800000 positive
3 Fri Nov 02 09:40:48 +0000 2018 1058292834699231233 0.800000 positive
4 Thu Nov 01 18:23:17 +0000 2018 1058061933243518976 0.233333 neutral
5 Thu Nov 01 17:50:39 +0000 2018 1058053723157618690 0.400000 positive
6 Wed Oct 31 18:57:53 +0000 2018 1057708251758903296 0.566667 positive
7 Sun Oct 28 17:21:24 +0000 2018 1056596810570100736 0.000000 neutral
8 Sun Oct 21 13:00:53 +0000 2018 1053994531845296128 0.136364 neutral
9 Sun Oct 21 12:55:12 +0000 2018 1053993101205868544 0.083333 neutral
So far I have managed to aggregate to the monthly totals, with the following code:
import pandas as pd
tweets = process_twitter_json(file_name)
#print(tweets[:10])
df = pd.DataFrame.from_records(tweets)
print(df.head(10))
#make the string date into a date field
df['tweet_datetime'] = pd.to_datetime(df['created_at'])
df.index = df['tweet_datetime']
#print('Monthly counts')
monthly_sentiment = df.groupby('sentiment')['tweet_datetime'].resample('M').count()
I'm struggling with how to chart the data.
Do I pivot to turn each of the discreet values within the sentiment
field as separate columns
I've tried .unstack() that turns the sentiment values into rows,
which is almost there, but the problem is dates become string column
headers, which is no good for charting

OK I changed the monthly aggregation method and used Grouper instead of resample, this meant that when I did the unstack() the resulting dataframe was vertical (deep and narrow) with dates as rows rather than horizontal with the dates as columns headers which meant I no longer had issues with dates being stored as strings when I came to chart it.
Full code:
import pandas as pd
tweets = process_twitter_json(file_name)
df = pd.DataFrame.from_records(tweets)
df['tweet_datetime'] = pd.to_datetime(df['created_at'])
df.index = df['tweet_datetime']
grouper = df.groupby(['sentiment', pd.Grouper(key='tweet_datetime', freq='M')]).id.count()
result = grouper.unstack('sentiment').fillna(0)
##=================================================
##PLOTLY - charts in Jupyter
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
print (__version__)# requires version >= 1.9.0
import plotly.graph_objs as go
init_notebook_mode(connected=True)
trace0 = go.Scatter(
x = result.index,
y = result['positive'],
name = 'Positive',
line = dict(
color = ('rgb(205, 12, 24)'),
width = 4)
)
trace1 = go.Scatter(
x = result.index,
y = result['negative'],
name = 'Negative',
line = dict(
color = ('rgb(22, 96, 167)'),
width = 4)
)
trace2 = go.Scatter(
x = result.index,
y = result['neutral'],
name = 'Neutral',
line = dict(
color = ('rgb(12, 205, 24)'),
width = 4)
)
data = [trace0, trace1, trace2]
iplot(data)

Related

plotly python Sankey Plot

I am trying to create a Sankey Diagram in python. The idea is to show the change of
size of each Topic month on month.This is my pandas sample DataFrame. There are more Topic and also each Topic has more month and year. That makes the dataframe tall.
df
year month Topic Document_Size
0 2022 1 0.0 63
1 2022 1 1.0 120
2 2022 1 2.0 106
3 2022 2 0.0 70
4 2022 2 1.0 42
5 2022 2 2.0 45
6 2022 3 0.0 78
7 2022 3 1.0 14
8 2022 3 2.0 84
I have prepared the following from plotly demo. I am missing the values that will go to the variables node_label, source_node, target_node so that the following code works. I am not getting the correct plot output
node_label = ?
source_node = ?
target_node = ?
values = df['Document_Size']
from webcolors import hex_to_rgb
%matplotlib inline
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objects as go # Import the graphical object
fig = go.Figure(
data=[go.Sankey( # The plot we are interest
# This part is for the node information
node = dict(
label = node_label
),
# This part is for the link information
link = dict(
source = source_node,
target = target_node,
value = values
))])
# With this save the plots
plot(fig,
image_filename='sankey_plot_1',
image='png',
image_width=5000,
image_height=3000)
# And shows the plot
fig.show()
reusing this answer sankey from dataframe
restructure dataframe so that it has structure used in answer
Document_Size
source
target
63
2022 01 0.0
2022 02 0.0
120
2022 01 1.0
2022 02 1.0
106
2022 01 2.0
2022 02 2.0
70
2022 02 0.0
2022 03 0.0
42
2022 02 1.0
2022 03 1.0
45
2022 02 2.0
2022 03 2.0
import pandas as pd
import io
import numpy as np
import plotly.graph_objects as go
df = pd.read_csv(
io.StringIO(
""" year month Topic Document_Size
0 2022 1 0.0 63
1 2022 1 1.0 120
2 2022 1 2.0 106
3 2022 2 0.0 70
4 2022 2 1.0 42
5 2022 2 2.0 45
6 2022 3 0.0 78
7 2022 3 1.0 14
8 2022 3 2.0 84"""
),
sep="\s+",
)
# data for year and month
df["date"] = pd.to_datetime(df.assign(day=1).loc[:, ["year", "month", "day"]])
# index dataframe ready for constructing dataframe of source and target
df = df.drop(columns=["year", "month"]).set_index(["date", "Topic"])
dates = df.index.get_level_values(0).unique()
# for each pair of current date and next date, construct segment of source /' target data
df_sankey = pd.concat(
[df.loc[s].assign(source=s, target=t) for s, t in zip(dates, dates[1:])]
)
df_sankey["source"] = df_sankey["source"].dt.strftime(
"%Y %m "
) + df_sankey.index.astype(str)
df_sankey["target"] = df_sankey["target"].dt.strftime(
"%Y %m "
) + df_sankey.index.astype(str)
nodes = np.unique(df_sankey[["source", "target"]], axis=None)
nodes = pd.Series(index=nodes, data=range(len(nodes)))
go.Figure(
go.Sankey(
node={"label": nodes.index},
link={
"source": nodes.loc[df_sankey["source"]],
"target": nodes.loc[df_sankey["target"]],
"value": df_sankey["Document_Size"],
},
)
)
output
The Sankey chart can also be created with the D3Blocks library.
Install first:
pip install d3blocks
The data frame should be restructured in the form:
# source target weight
# 2022 01 0.0 2022 02 0.0 63
# 2022 01 1.0 2022 02 1.0 120
# etc
# Load d3blocks
from d3blocks import D3Blocks
# Initialize
d3 = D3Blocks()
# Load example data
df = d3.import_example('energy')
# Plot
d3.sankey(df, filepath='sankey.html')

Box Whisker plot of date frequency

Good morning all!
I have a Pandas df and Im trying to create a monthly box and whisker of 30 years ofdata.
DataFrame
datetime year month day hour lon lat
0 3/18/1986 10:17 1986 3 18 10 -124.835 46.540
1 6/7/1986 13:38 1986 6 7 13 -121.669 46.376
2 7/17/1986 20:56 1986 7 17 20 -122.436 48.044
3 7/26/1986 2:46 1986 7 26 2 -123.071 48.731
4 8/2/1986 19:54 1986 8 2 19 -123.654 48.480
Trying to see the mean amount of occurrences in X month, the median, and the max/min occurrence ( and date of max and min)..
Ive been playing around with pandas.DataFrame.groupby() but dont fully understand it.
I have grouped the date by month and day occurrences. I like this format:
Code:
df = pd.read_csv(masterCSVPath)
months = df['month']
test = df.groupby(['month','day'])['day'].count()
output: ---->
month day
1 1 50
2 103
3 97
4 29
5 60
...
12 27 24
28 7
29 17
30 18
31 9
So how can i turn that df above into a box/whisker plot?
The x-axis i want to be months..
y axis == occurrences
Try this (without doing groupby):
import matplotlib.pyplot as plt
import seaborn as sns
sns.boxplot(x = 'month', y = 'day', data = df)
In case you want the months to be in Jan, Feb format then try this:
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
df['month_new'] = df['datetime'].dt.strftime('%b')
sns.boxplot(x = 'month_new', y = 'day', data = df)

Convert column values into rows in the order in which columns are present

Below is a sample dataframe I have. I need to convert each row into multiple rows based on month.
df = pd.DataFrame({'Jan': [100,200,300],
'Feb': [400,500,600],
'March':[700,800,900],
})
Desired output :
Jan 100
Feb 400
March 700
Jan 200
Feb 500
March 800
Jan 300
Feb 600
March 900
Tried using pandas melt function but what it does is it will group Jan together, then Feb and March. It will be like 3 rows for Jan, then 3 for Feb and same for March. But i want to achieve the above output. Could someone please help ?
Use DataFrame.stack with some data cleaning by Series.reset_index with Series.rename_axis:
df1 = (df.stack()
.reset_index(level=0, drop=True)
.rename_axis('months')
.reset_index(name='val'))
Or use numpy - flatten values and repeat columns names by numpy.tile:
df1 = pd.DataFrame({'months': np.tile(df.columns, len(df)),
'val': df.values.reshape(1,-1).ravel()})
print (df1)
months val
0 Jan 100
1 Feb 400
2 March 700
3 Jan 200
4 Feb 500
5 March 800
6 Jan 300
7 Feb 600
8 March 900

How to run a script on x axis of plots in matplotlib [duplicate]

I want to transform an integer between 1 and 12 into an abbrieviated month name.
I have a df which looks like:
client Month
1 sss 02
2 yyy 12
3 www 06
I want the df to look like this:
client Month
1 sss Feb
2 yyy Dec
3 www Jun
Most of the info I found was not in python>pandas>dataframe hence the question.
You can do this efficiently with combining calendar.month_abbr and df[col].apply()
import calendar
df['Month'] = df['Month'].apply(lambda x: calendar.month_abbr[x])
Since the abbreviated month names is the first three letters of their full names, we could first convert the Month column to datetime and then use dt.month_name() to get the full month name and finally use str.slice() method to get the first three letters, all using pandas and only in one line of code:
df['Month'] = pd.to_datetime(df['Month'], format='%m').dt.month_name().str.slice(stop=3)
df
Month client
0 Feb sss
1 Dec yyy
2 Jun www
The calendar module is useful, but calendar.month_abbr is array-like: it cannot be used directly in a vectorised fashion. For an efficient mapping, you can construct a dictionary and then use pd.Series.map:
import calendar
d = dict(enumerate(calendar.month_abbr))
df['Month'] = df['Month'].map(d)
Performance benchmarking shows a ~130x performance differential:
import calendar
d = dict(enumerate(calendar.month_abbr))
mapper = calendar.month_abbr.__getitem__
np.random.seed(0)
n = 10**5
df = pd.DataFrame({'A': np.random.randint(1, 13, n)})
%timeit df['A'].map(d) # 7.29 ms per loop
%timeit df['A'].map(mapper) # 946 ms per loop
Solution 1: One liner
df['Month'] = pd.to_datetime(df['Month'], format='%m').dt.strftime('%b')
Solution 2: Using apply()
def mapper(month):
return month.strftime('%b')
df['Month'] = df['Month'].apply(mapper)
Reference:
http://strftime.org/
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
using datetime object methods
I'm surpised this answer doesn't have a solution using strftime
note, you'll need to have a valid datetime object before using the strftime method, use pd.to_datetime(df['date_column']) to cast your target column to a datetime object.
import pandas as pd
dates = pd.date_range('01-Jan 2020','01-Jan 2021',freq='M')
df = pd.DataFrame({'dates' : dates})
df['month_name'] = df['dates'].dt.strftime('%b')
dates month_name
0 2020-01-31 Jan
1 2020-02-29 Feb
2 2020-03-31 Mar
3 2020-04-30 Apr
4 2020-05-31 May
5 2020-06-30 Jun
6 2020-07-31 Jul
7 2020-08-31 Aug
8 2020-09-30 Sep
9 2020-10-31 Oct
10 2020-11-30 Nov
11 2020-12-31 Dec
another method would be to slice the name using dt.month_name()
df['month_name_str_slice'] = df['dates'].dt.month_name().str[:3]
dates month_name month_name_str_slice
0 2020-01-31 Jan Jan
1 2020-02-29 Feb Feb
2 2020-03-31 Mar Mar
3 2020-04-30 Apr Apr
4 2020-05-31 May May
5 2020-06-30 Jun Jun
6 2020-07-31 Jul Jul
7 2020-08-31 Aug Aug
8 2020-09-30 Sep Sep
9 2020-10-31 Oct Oct
10 2020-11-30 Nov Nov
11 2020-12-31 Dec Dec
You can do this easily with a column apply.
import pandas as pd
df = pd.DataFrame({'client':['sss', 'yyy', 'www'], 'Month': ['02', '12', '06']})
look_up = {'01': 'Jan', '02': 'Feb', '03': 'Mar', '04': 'Apr', '05': 'May',
'06': 'Jun', '07': 'Jul', '08': 'Aug', '09': 'Sep', '10': 'Oct', '11': 'Nov', '12': 'Dec'}
df['Month'] = df['Month'].apply(lambda x: look_up[x])
df
Month client
0 Feb sss
1 Dec yyy
2 Jun www
One way of doing that is with the apply method in the dataframe but, to do that, you need a map to convert the months. You could either do that with a function / dictionary or with Python's own datetime.
With the datetime it would be something like:
def mapper(month):
date = datetime.datetime(2000, month, 1) # You need a dateobject with the proper month
return date.strftime('%b') # %b returns the months abbreviation, other options [here][1]
df['Month'].apply(mapper)
In a simillar way, you could build your own map for custom names. It would look like this:
months_map = {01: 'Jan', 02: 'Feb'}
def mapper(month):
return months_map[month]
Obviously, you don't need to define this functions explicitly and could use a lambda directly in the apply method.
Use strptime and lambda function for this:
from time import strptime
df['Month'] = df['Month'].apply(lambda x: strptime(x,'%b').tm_mon)
Suppose we have a DF like this, and Date is already in DateTime Format:
df.head(3)
value
date
2016-05-19 19736
2016-05-26 18060
2016-05-27 19997
Then we can extract month number and month name easily like this :
df['month_num'] = df.index.month
df['month'] = df.index.month_name()
value year month_num month
date
2017-01-06 37353 2017 1 January
2019-01-06 94108 2019 1 January
2019-01-05 77897 2019 1 January
2019-01-04 94514 2019 1 January
Having tested all of these on a large dataset, I have found the following to be fastest:
import calendar
def month_mapping():
# I'm lazy so I have a stash of functions already written so
# I don't have to write them out every time. This returns the
# {1:'Jan'....12:'Dec'} dict in the laziest way...
abbrevs = {}
for month in range (1, 13):
abbrevs[month] = calendar.month_abbr[month]
return abbrevs
abbrevs = month_mapping()
df['Month Abbrev'} = df['Date Col'].dt.month.map(mapping)
You can use Pandas month_name() function. Example:
>>> idx = pd.date_range(start='2018-01', freq='M', periods=3)
>>> idx
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31'],
dtype='datetime64[ns]', freq='M')
>>> idx.month_name()
Index(['January', 'February', 'March'], dtype='object')
For more detail visit this link.
the best way would be to do with month_name() as commented by
Nurul Akter Towhid.
df['Month'] = df.Month.dt.month_name()
First you need to strip "0 " in the beginning (as u might get the exception leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers)
step1)
def func(i):
if i[0] == '0':
i = i[1]
return(i)
df["Month"] = df["Month"].apply(lambda x: func(x))
Step2:
df["Month"] = df["Month"].apply(lambda x: calendar.month_name(x))

Sorting datetime column when month is formatted as name

I have a column Date Time in my dataframe which has the date and time as a string.
Product Date Time
ABC Wed Jan 2 11:14:24 2019
ABC123 Wed Jan 2 11:14:24 2019
ABCXZY Wed Jan 2 11:14:24 2019
BVF123 Mon Jan 14 10:24:20 2019
ABC456 Mon Jan 14 10:24:20 2019
ABC000 Mon Feb 4 10:44:08 2019
ABCXYZ Mon Feb 4 10:44:08 2019
ABC678 Mon Feb 4 10:44:08 2019
ABCQYZ Wed Feb 20 09:14:40 2019
ABC090 Wed Feb 20 09:14:40 2019
I have converted this column to a datetime format using -
df['Date'] = pd.to_datetime(df['Date Time']).dt.strtime('%d-%b-%Y')
I want to now sort this dataframe on the basis of the Date column to plot the quantities for each date in ascending order of date, but when I use -
df.sort_values(by='Date', inplace=True, ascending=True)
it only gets sorted by the date and ignores the month name, i.e as
02-Jan-2019
04-Feb-2019
08-Mar-2019
13-Feb-2019
14-Jan-2019
20-Feb-2019
21-Mar-2019
instead of
02-Jan-2019
14-Jan-2019
04-Feb-2019
13-Feb-2019
20-Feb-2019
08-Mar-2019
21-Mar-2019
How can I get the desired sorting using pandas datetime or any other module?
pd.to_datetime(df['Date Time']).dt.strtime('%d-%b-%Y')
returns a series of string ("object type" to be precise) but not a series of datetime. That's why your sorting is wrong.
Here is a code to do it :
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
### Dataframe
data = {'Product' : ['ABC', 'ABC123', 'ABCXZY', 'BVF123', 'ABC456', 'ABC000', 'ABCXYZ', 'ABC678', 'ABCQYZ', 'ABC090'], 'Date Time' : ['Wed Jan 2 11:14:24 2019', 'Wed Jan 2 11:14:24 2019', 'Wed Jan 2 11:14:24 2019', 'Mon Jan 14 10:24:20 2019', 'Mon Jan 14 10:24:20 2019', 'Mon Feb 4 10:44:08 2019', 'Mon Feb 4 10:44:08 2019', 'Mon Feb 4 10:44:08 2019', 'Wed Feb 20 09:14:40 2019', 'Wed Feb 20 09:14:40 2019']}
df = pd.DataFrame(data)
### Conversion to datetime
df['Date'] = pd.to_datetime(df.loc[:, 'Date Time'])
### Sorting
df.sort_values(by = 'Date', inplace = True)
### Plot
ax = df.groupby('Date').count().Product.plot()
ax.xaxis.set_major_formatter(mdates.DateFormatter('%d-%b-%Y')) # Formatting x labels
We can do argsort
df=df.iloc[pd.to_datetime(df['Date Time'],format='%d-%b-%Y').argsort()]
Out[20]:
Date Time
3 14-Jan-2019
0 04-Feb-2019
2 13-Feb-2019
4 20-Feb-2019
1 08-Mar-2019
5 21-Mar-2019
Update
s=df.groupby(['Date Time']).size()
s.index=pd.to_datetime(s.index,format='%d-%b-%Y')
s.sort_index(inplace=True)
s.index=s.index.strftime('%d-%b-%Y')

Resources