plotly python Sankey Plot - python-3.x

I am trying to create a Sankey Diagram in python. The idea is to show the change of
size of each Topic month on month.This is my pandas sample DataFrame. There are more Topic and also each Topic has more month and year. That makes the dataframe tall.
df
year month Topic Document_Size
0 2022 1 0.0 63
1 2022 1 1.0 120
2 2022 1 2.0 106
3 2022 2 0.0 70
4 2022 2 1.0 42
5 2022 2 2.0 45
6 2022 3 0.0 78
7 2022 3 1.0 14
8 2022 3 2.0 84
I have prepared the following from plotly demo. I am missing the values that will go to the variables node_label, source_node, target_node so that the following code works. I am not getting the correct plot output
node_label = ?
source_node = ?
target_node = ?
values = df['Document_Size']
from webcolors import hex_to_rgb
%matplotlib inline
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objects as go # Import the graphical object
fig = go.Figure(
data=[go.Sankey( # The plot we are interest
# This part is for the node information
node = dict(
label = node_label
),
# This part is for the link information
link = dict(
source = source_node,
target = target_node,
value = values
))])
# With this save the plots
plot(fig,
image_filename='sankey_plot_1',
image='png',
image_width=5000,
image_height=3000)
# And shows the plot
fig.show()

reusing this answer sankey from dataframe
restructure dataframe so that it has structure used in answer
Document_Size
source
target
63
2022 01 0.0
2022 02 0.0
120
2022 01 1.0
2022 02 1.0
106
2022 01 2.0
2022 02 2.0
70
2022 02 0.0
2022 03 0.0
42
2022 02 1.0
2022 03 1.0
45
2022 02 2.0
2022 03 2.0
import pandas as pd
import io
import numpy as np
import plotly.graph_objects as go
df = pd.read_csv(
io.StringIO(
""" year month Topic Document_Size
0 2022 1 0.0 63
1 2022 1 1.0 120
2 2022 1 2.0 106
3 2022 2 0.0 70
4 2022 2 1.0 42
5 2022 2 2.0 45
6 2022 3 0.0 78
7 2022 3 1.0 14
8 2022 3 2.0 84"""
),
sep="\s+",
)
# data for year and month
df["date"] = pd.to_datetime(df.assign(day=1).loc[:, ["year", "month", "day"]])
# index dataframe ready for constructing dataframe of source and target
df = df.drop(columns=["year", "month"]).set_index(["date", "Topic"])
dates = df.index.get_level_values(0).unique()
# for each pair of current date and next date, construct segment of source /' target data
df_sankey = pd.concat(
[df.loc[s].assign(source=s, target=t) for s, t in zip(dates, dates[1:])]
)
df_sankey["source"] = df_sankey["source"].dt.strftime(
"%Y %m "
) + df_sankey.index.astype(str)
df_sankey["target"] = df_sankey["target"].dt.strftime(
"%Y %m "
) + df_sankey.index.astype(str)
nodes = np.unique(df_sankey[["source", "target"]], axis=None)
nodes = pd.Series(index=nodes, data=range(len(nodes)))
go.Figure(
go.Sankey(
node={"label": nodes.index},
link={
"source": nodes.loc[df_sankey["source"]],
"target": nodes.loc[df_sankey["target"]],
"value": df_sankey["Document_Size"],
},
)
)
output

The Sankey chart can also be created with the D3Blocks library.
Install first:
pip install d3blocks
The data frame should be restructured in the form:
# source target weight
# 2022 01 0.0 2022 02 0.0 63
# 2022 01 1.0 2022 02 1.0 120
# etc
# Load d3blocks
from d3blocks import D3Blocks
# Initialize
d3 = D3Blocks()
# Load example data
df = d3.import_example('energy')
# Plot
d3.sankey(df, filepath='sankey.html')

Related

Modelling a moving window with a shift( ) function in python problem

Problem: Lets suppose that we supply robots to a factory. Each of these robots is programmed to switch into the work mode after 3 days (e.g. if it arrives on day 1, it starts working on day 3), and then they work for 5 days. after that, the battery runs out and they stop working. The number of robots supplied each day varies.
The following code is the supplies for the first 15 days like so:
import pandas as pd
df = pd.DataFrame({
'date': ['01','02', '03', '04', '05','06', \
'07','08','09','10', '11', '12', '13', '14', '15'],
'value': [10,20,20,30,20,10,30,20,10,20,30,40,20,20,20]
})
df.set_index('date',inplace=True)
df
Let's now estimate the number of working robots on each of these days like so ( we move two days back and sum up only the numbers within the past 5 days):
04 10
05 20+10 = 30
06 20+20 = 40
07 30+20 = 50
08 20+30 = 50
09 10+20 = 30
10 30+10 = 40
11 20+30 = 50
12 10+20 = 30
13 20+10 = 30
14 30+20 = 50
15 40+30 = 70
Is it possible to model this in python? I have tried this - not quite but close.
df_p = (((df.rolling(2)).sum())).shift(5).rolling(1).mean().shift(-3)
p.s. if you dont think its complicated enough then I also need to include the last 7-day average for each of these numbers for my real problem.
Let's try shift forward first the window (5) less the rolling window length (2) and taking rolling sum with min periods set to 1:
shift_window = 5
rolling_window = 2
df['new_col'] = (
df['value'].shift(shift_window - rolling_window)
.rolling(rolling_window, min_periods=1).sum()
)
Or with hard coded values:
df['new_col'] = df['value'].shift(3).rolling(2, min_periods=1).sum()
df:
value new_col
date
01 10 NaN
02 20 NaN
03 20 NaN
04 30 10.0
05 20 30.0
06 10 40.0
07 30 50.0
08 20 50.0
09 10 30.0
10 20 40.0
11 30 50.0
12 40 30.0
13 20 30.0
14 20 50.0
15 20 70.0

Box Whisker plot of date frequency

Good morning all!
I have a Pandas df and Im trying to create a monthly box and whisker of 30 years ofdata.
DataFrame
datetime year month day hour lon lat
0 3/18/1986 10:17 1986 3 18 10 -124.835 46.540
1 6/7/1986 13:38 1986 6 7 13 -121.669 46.376
2 7/17/1986 20:56 1986 7 17 20 -122.436 48.044
3 7/26/1986 2:46 1986 7 26 2 -123.071 48.731
4 8/2/1986 19:54 1986 8 2 19 -123.654 48.480
Trying to see the mean amount of occurrences in X month, the median, and the max/min occurrence ( and date of max and min)..
Ive been playing around with pandas.DataFrame.groupby() but dont fully understand it.
I have grouped the date by month and day occurrences. I like this format:
Code:
df = pd.read_csv(masterCSVPath)
months = df['month']
test = df.groupby(['month','day'])['day'].count()
output: ---->
month day
1 1 50
2 103
3 97
4 29
5 60
...
12 27 24
28 7
29 17
30 18
31 9
So how can i turn that df above into a box/whisker plot?
The x-axis i want to be months..
y axis == occurrences
Try this (without doing groupby):
import matplotlib.pyplot as plt
import seaborn as sns
sns.boxplot(x = 'month', y = 'day', data = df)
In case you want the months to be in Jan, Feb format then try this:
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
df['month_new'] = df['datetime'].dt.strftime('%b')
sns.boxplot(x = 'month_new', y = 'day', data = df)

How to split rows into columns in a dataframe

I have this dataframe (with dimension 840rows x 1columns):
0 151284 Apr 19 11:37 0-01-20200419063614
1 48054 Apr 21 12:50 0-01-20200421074934
2 187588 Apr 21 13:55 0-01-20200421085439
3 51584 Apr 21 14:37 0-01-20200421143636
4 63522 Apr 22 08:40 0-01-20200422083937
I want to convert this dataframe into a format like this:
id datetime size
151284 2020-04-19 11:37:00 0-01-20200419063614
. . .
datetime being in the format: (yyyy-mm-dd)(hr-min-sec). So basically splitting a single column into three columns and also combining date and time into a single datetime column in a standard format.
Any help is appreciated.
EDIT: output of df.columns: Index(['col'], dtype='object')
Like this:
In [70]: df = pd.DataFrame({'col':['151284 Apr 19 11:37 0-01-20200419063614', '48054 Apr 21 12:50 0-01-20200421074934', '187588 Apr 21 13:55 0-01-20200421085439', '51584 Apr 21 14:37 0-01-20200421143636',
...: '63522 Apr 22 08:40 0-01-20200422083937']})
In [54]: df['id'] = df.col.str.split(' ').str[0]
In [55]: df['Datetime'] = df.col.str.split(' ').str[1] + ' ' + df.col.str.split(' ').str[2] + ' ' + df.col.str.split(' ').str[3]
In [57]: df['Size'] = df.col.str.split(' ').str[-1]
In [63]: from dateutil import parser
In [65]: def format_datetime(x):
...: return parser.parse(x)
...:
In [67]: df['Datetime'] = df.Datetime.apply(format_datetime)
In [79]: df
Out[79]:
id Datetime Size
0 151284 2020-04-19 11:37:00 0-01-20200419063614
1 48054 2020-04-21 12:50:00 0-01-20200421074934
2 187588 2020-04-21 13:55:00 0-01-20200421085439
3 51584 2020-04-21 14:37:00 0-01-20200421143636
4 63522 2020-04-22 08:40:00 0-01-20200422083937

Aggregate time series with group by and create chart with multiple series

I have time series data and I want to create a chart of the monthly (x-axis) counts of the number of records (lines chart), grouped by sentiment (multiple lines)
Data looks like this
created_at id polarity sentiment
0 Fri Nov 02 11:22:47 +0000 2018 1058318498663870464 0.000000 neutral
1 Fri Nov 02 11:20:54 +0000 2018 1058318026758598656 0.011905 neutral
2 Fri Nov 02 09:41:37 +0000 2018 1058293038739607552 0.800000 positive
3 Fri Nov 02 09:40:48 +0000 2018 1058292834699231233 0.800000 positive
4 Thu Nov 01 18:23:17 +0000 2018 1058061933243518976 0.233333 neutral
5 Thu Nov 01 17:50:39 +0000 2018 1058053723157618690 0.400000 positive
6 Wed Oct 31 18:57:53 +0000 2018 1057708251758903296 0.566667 positive
7 Sun Oct 28 17:21:24 +0000 2018 1056596810570100736 0.000000 neutral
8 Sun Oct 21 13:00:53 +0000 2018 1053994531845296128 0.136364 neutral
9 Sun Oct 21 12:55:12 +0000 2018 1053993101205868544 0.083333 neutral
So far I have managed to aggregate to the monthly totals, with the following code:
import pandas as pd
tweets = process_twitter_json(file_name)
#print(tweets[:10])
df = pd.DataFrame.from_records(tweets)
print(df.head(10))
#make the string date into a date field
df['tweet_datetime'] = pd.to_datetime(df['created_at'])
df.index = df['tweet_datetime']
#print('Monthly counts')
monthly_sentiment = df.groupby('sentiment')['tweet_datetime'].resample('M').count()
I'm struggling with how to chart the data.
Do I pivot to turn each of the discreet values within the sentiment
field as separate columns
I've tried .unstack() that turns the sentiment values into rows,
which is almost there, but the problem is dates become string column
headers, which is no good for charting
OK I changed the monthly aggregation method and used Grouper instead of resample, this meant that when I did the unstack() the resulting dataframe was vertical (deep and narrow) with dates as rows rather than horizontal with the dates as columns headers which meant I no longer had issues with dates being stored as strings when I came to chart it.
Full code:
import pandas as pd
tweets = process_twitter_json(file_name)
df = pd.DataFrame.from_records(tweets)
df['tweet_datetime'] = pd.to_datetime(df['created_at'])
df.index = df['tweet_datetime']
grouper = df.groupby(['sentiment', pd.Grouper(key='tweet_datetime', freq='M')]).id.count()
result = grouper.unstack('sentiment').fillna(0)
##=================================================
##PLOTLY - charts in Jupyter
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
print (__version__)# requires version >= 1.9.0
import plotly.graph_objs as go
init_notebook_mode(connected=True)
trace0 = go.Scatter(
x = result.index,
y = result['positive'],
name = 'Positive',
line = dict(
color = ('rgb(205, 12, 24)'),
width = 4)
)
trace1 = go.Scatter(
x = result.index,
y = result['negative'],
name = 'Negative',
line = dict(
color = ('rgb(22, 96, 167)'),
width = 4)
)
trace2 = go.Scatter(
x = result.index,
y = result['neutral'],
name = 'Neutral',
line = dict(
color = ('rgb(12, 205, 24)'),
width = 4)
)
data = [trace0, trace1, trace2]
iplot(data)

Plotting line graph on the same figure using matplotlib [duplicate]

I have a temperature file with many years temperature records, in a format as below:
2012-04-12,16:13:09,20.6
2012-04-12,17:13:09,20.9
2012-04-12,18:13:09,20.6
2007-05-12,19:13:09,5.4
2007-05-12,20:13:09,20.6
2007-05-12,20:13:09,20.6
2005-08-11,11:13:09,20.6
2005-08-11,11:13:09,17.5
2005-08-13,07:13:09,20.6
2006-04-13,01:13:09,20.6
Every year has different numbers, time of the records, so the pandas datetimeindices are all different.
I want to plot the different year's data in the same figure for comparing . The X-axis is Jan to Dec, the Y-axis is temperature. How should I go about doing this?
Try:
ax = df1.plot()
df2.plot(ax=ax)
If you a running Jupyter/Ipython notebook and having problems using;
ax = df1.plot()
df2.plot(ax=ax)
Run the command inside of the same cell!! It wont, for some reason, work when they are separated into sequential cells. For me at least.
Chang's answer shows how to plot a different DataFrame on the same axes.
In this case, all of the data is in the same dataframe, so it's better to use groupby and unstack.
Alternatively, pandas.DataFrame.pivot_table can be used.
dfp = df.pivot_table(index='Month', columns='Year', values='value', aggfunc='mean')
When using pandas.read_csv, names= creates column headers when there are none in the file. The 'date' column must be parsed into datetime64[ns] Dtype so the .dt extractor can be used to extract the month and year.
import pandas as pd
# given the data in a file as shown in the op
df = pd.read_csv('temp.csv', names=['date', 'time', 'value'], parse_dates=['date'])
# create additional month and year columns for convenience
df['Year'] = df.date.dt.year
df['Month'] = df.date.dt.month
# groupby the month a year and aggreate mean on the value column
dfg = df.groupby(['Month', 'Year'])['value'].mean().unstack()
# display(dfg)
Year 2005 2006 2007 2012
Month
4 NaN 20.6 NaN 20.7
5 NaN NaN 15.533333 NaN
8 19.566667 NaN NaN NaN
Now it's easy to plot each year as a separate line. The OP only has one observation for each year, so only a marker is displayed.
ax = dfg.plot(figsize=(9, 7), marker='.', xticks=dfg.index)
To do this for multiple dataframes, you can do a for loop over them:
fig = plt.figure(num=None, figsize=(10, 8))
ax = dict_of_dfs['FOO'].column.plot()
for BAR in dict_of_dfs.keys():
if BAR == 'FOO':
pass
else:
dict_of_dfs[BAR].column.plot(ax=ax)
This can also be implemented without the if condition:
fig, ax = plt.subplots()
for BAR in dict_of_dfs.keys():
dict_of_dfs[BAR].plot(ax=ax)
You can make use of the hue parameter in seaborn. For example:
import seaborn as sns
df = sns.load_dataset('flights')
year month passengers
0 1949 Jan 112
1 1949 Feb 118
2 1949 Mar 132
3 1949 Apr 129
4 1949 May 121
.. ... ... ...
139 1960 Aug 606
140 1960 Sep 508
141 1960 Oct 461
142 1960 Nov 390
143 1960 Dec 432
sns.lineplot(x='month', y='passengers', hue='year', data=df)

Resources