plot data from two DataFrames with different dates - python-3.x

I'm trying to plot data from two dataframes in the same figure. The problem is that I'm using calendar dates for my x axis, and pandas apparently does not like this. The code below shows a minimum example of what I'm trying to do. There are two datasets with some numeric value associated with calendar dates. the data on the second data frame is posterior to the data on the first data frame. I wanted to plot them both in the same figure with appropriate dates and different line colors. the problem is that the pandas.DataFrame.plot method joins the starting date of both dataframes in the chart, thus rendering the visualization useless.
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.DataFrame({'date': ['2020-03-10', '2020-03-11', '2020-03-12', '2020-03-13', '2020-03-14', '2020-03-15'],
'number': [1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'date': ['2020-03-16', '2020-03-17', '2020-03-18', '2020-03-19'],
'number': [7, 6, 5, 4]})
ax = df1.plot(x='date', y='number', label='beginning')
df2.plot(x='date', y='number', label='ending', ax=ax)
plt.show()
The figure created looks like this:
Is there any way I can fix this? Could I also get dates to be shown in the x-axis tilted so they're also more legible?

You need to cast 'date' to datetime dtype using pd.to_datetime:
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.DataFrame({'date': ['2020-03-10', '2020-03-11', '2020-03-12', '2020-03-13', '2020-03-14', '2020-03-15'],
'number': [1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'date': ['2020-03-16', '2020-03-17', '2020-03-18', '2020-03-19'],
'number': [7, 6, 5, 4]})
df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])
ax = df1.plot(x='date', y='number', label='beginning')
df2.plot(x='date', y='number', label='ending', ax=ax)
plt.show()
Output:

Related

Plotting aggregated values from specific column of multiple dataframe indexed by timedate

I have three dataframe as below:
import pandas as pd
labels=['1','2','3','Aggregated']
df1 = {'date_time': ["2022-10-06 17:23:11","2022-10-06 17:23:12","2022-10-06 17:23:13","2022-10-06 17:23:14","2022-10-06 17:23:15","2022-10-06 17:23:16"],
'value': [4, 5, 6, 7, 8, 9]}
df2 = {'date_time': ["2022-10-06 17:23:13","2022-10-06 17:23:14","2022-10-06 17:23:15","2022-10-06 17:23:16","2022-10-06 17:23:17","2022-10-06 17:23:18"],
'value': [4, 5, 6, 7, 8, 9]}
df3 = {'date_time': ["2022-10-06 17:23:16","2022-10-06 17:23:17","2022-10-06 17:23:18","2022-10-06 17:23:19","2022-10-06 17:23:20","2022-10-06 17:23:21"],
'value': [4, 5, 6, 7, 8, 9]}
I need to create another dataframe df that contains all the datetime elements from all three df1,df2,df3 such that the common valued are summed up in-terms of common timestamps (excluding the millisecond parts) as shown below.
df=
{'date_time': ["2022-10-06 17:23:11","2022-10-06 17:23:12","2022-10-06 17:23:13","2022-10-06 17:23:14","2022-10-06 17:23:15","2022-10-06 17:23:16","2022-10-06 17:23:17","2022-10-06 17:23:18","2022-10-06 17:23:19","2022-10-06 17:23:20","2022-10-06 17:23:21"],
'value': [4, 5, 6+4, 7+5, 8+6, 9+7+4, 8+5, 9+6, 7, 8, 9]}
For adding the columns I used following:
df = (pd.concat([df1,df2,df3],axis=0)).groupby('date_time')['value'].sum().reset_index()
For plotting I used following which results in df2 and df3 to time shift towards df1.
for dataFrame,number in zip([df1,df2,df3,df],labels):
dataFrame["value"].plot(label=number)
How can I plot the three df1,df2,df3 without time shifting and also plot the aggregated df on the same plot for dataframe column 'value'?
IIUC you search for something like this:
labels=["Aggregated","1","2","3"]
color_dict = {"Aggregated": "orange", "1": "darkred", "2": "green", "3": "blue"}
fig, ax = plt.subplots()
for i, (d, label) in enumerate(zip([df, df1, df2, df3], labels),1):
ax.plot(d["date_time"], d["value"], lw=3/i, color=color_dict[label], label=label)
plt.setp(ax.get_xticklabels(), ha="right", rotation=45)
plt.legend()
plt.show()

How to convert each row in data frame to a node with attributes?

Given a sample data frame df:
import pandas as pd
df = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'a': [55, 2123, -19.3, 9, -8],
'b': ['aa', 'bb', 'ad', 'kuku', 'lulu']
})
Now I want to "upload" this data to a graph. Every row should be a node with id, a, and b attributes.
I have tried to do this with from_pandas_dataframe NetworkX method.
Please advise which function is responsible to do this in NetworkX?
The from_pandas_edgelist function requires both the source and target.
from networkx import nx
G = nx.from_pandas_edgelist(df, source='id', target='col_target')
You could use a loop to generate the graph:
from networkx import nx
G = nx.Graph()
for i, attr in df.set_index('id').iterrows():
G.add_node(i, **attr.to_dict())
Or, with from_pandas_edgelist, if you really want only the nodes, you might want to add a dummy target:
G = nx.from_pandas_edgelist(df.assign(target='none'),
source='id', target='target',
edge_attr=['a', 'b'])
Then remove the dummy node:
G.remove_node('none')

Matplotlib: plot the entire column values in pandas

I have the following data frame my_df:
my_1 my_2 my_3
--------------------------------
0 5 7 4
1 3 5 13
2 1 2 8
3 12 9 9
4 6 1 2
I want to make a plot where x-axis is categorical values with my_1, my_2, and my_3. y-axis is integer. For each column in my_df, I want to plot all its 5 values at x = my_i. What kind of plot should I use in matplotlib? Thanks!
You could make a bar chart:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'my_1': [5, 3, 1, 12, 6], 'my_2': [7, 5, 2, 9, 1], 'my_3': [4, 13, 8, 9, 2]})
df.T.plot(kind='bar')
plt.show()
or a scatter plot:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'my_1': [5, 3, 1, 12, 6], 'my_2': [7, 5, 2, 9, 1], 'my_3': [4, 13, 8, 9, 2]})
fig, ax = plt.subplots()
cols = np.arange(len(df.columns))
x = np.repeat(cols, len(df))
y = df.values.ravel(order='F')
color = np.tile(np.arange(len(df)), len(df.columns))
scatter = ax.scatter(x, y, s=150, c=color)
ax.set_xticks(cols)
ax.set_xticklabels(df.columns)
cbar = plt.colorbar(scatter)
cbar.set_ticks(np.arange(len(df)))
plt.show()
Just for fun, here is how to make the same scatter plot using Pandas' df.plot:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'my_1': [5, 3, 1, 12, 6], 'my_2': [7, 5, 2, 9, 1], 'my_3': [4, 13, 8, 9, 2]})
columns = df.columns
index = df.index
df = df.stack()
df.index.names = ['color', 'column']
df = df.rename('y').reset_index()
df['x'] = pd.Categorical(df['column']).codes
ax = df.plot(kind='scatter', x='x', y='y', c='color', colorbar=True,
cmap='viridis', s=150)
ax.set_xticks(np.arange(len(columns)))
ax.set_xticklabels(columns)
cbar = ax.collections[-1].colorbar
cbar.set_ticks(index)
plt.show()
Unfortunately, it requires quite a bit of DataFrame manipulation just to call
df.plot and then there are some extra matplotlib calls needed to set the tick
marks on the scatter plot and colorbar. Since Pandas is not saving effort here,
I would go with the first (NumPy/matplotlib) approach shown above.

How can I create a seaborn regression plot with multiindex dataframe?

I have time series data which are multi-indexed on (Year, Month) as seen here:
print(df.index)
print(df)
MultiIndex(levels=[[2016, 2017], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
labels=[[0, 0, 0, 0, 0, 0, 0, 0], [2, 3, 4, 5, 6, 7, 8, 9]],
names=['Year', 'Month'])
Value
Year Month
2016 3 65.018150
4 63.130035
5 71.071254
6 72.127967
7 67.357795
8 66.639228
9 64.815232
10 68.387698
I want to do very basic linear regression on these time series data. Because pandas.DataFrame.plot does not do any regression, I intend to use Seaborn to do my plotting.
I attempted to do this by using lmplot:
sns.lmplot(x=("Year", "Month"), y="Value", data=df, fit_reg=True)
but I get an error:
TypeError: '>' not supported between instances of 'str' and 'tuple'
This is particularly interesting to me because all elements in df.index.levels[:] are of type numpy.int64, all elements in df.index.labels[:] are of type numpy.int8.
Why am I receiving this error? How can I resolve it?
You can use reset_index to turn the dataframe's index into columns. Plotting DataFrames columns is then straight forward with seaborn.
As I guess the reason to use lmplot would be to show different regressions for different years (otherwise a regplot may be better suited), the "Year"column can be used as hue.
import numpy as np
import pandas as pd
import seaborn.apionly as sns
import matplotlib.pyplot as plt
iterables = [[2016, 2017], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]]
index = pd.MultiIndex.from_product(iterables, names=['Year', 'Month'])
df = pd.DataFrame({"values":np.random.rand(24)}, index=index)
df2 = df.reset_index() # or, df.reset_index(inplace=True) if df is not required otherwise
g = sns.lmplot(x="Month", y="values", data=df2, hue="Year")
plt.show()
Consider the following approach:
df['x'] = df.index.get_level_values(0) + df.index.get_level_values(1)/100
yields:
In [49]: df
Out[49]:
Value x
Year Month
2016 3 65.018150 2016.03
4 63.130035 2016.04
5 71.071254 2016.05
6 72.127967 2016.06
7 67.357795 2016.07
8 66.639228 2016.08
9 64.815232 2016.09
10 68.387698 2016.10
let's prepare X-ticks labels:
labels = df.index.get_level_values(0).astype(str) + '-' + \
df.index.get_level_values(1).astype(str).str.zfill(2)
sns.lmplot(x='x', y='Value', data=df, fit_reg=True)
ax = plt.gca()
ax.set_xticklabels(labels)
Result:

How to add column next to Seaborn heat map

Given the code below, which produces a heat map, how can I get column "D" (the total column)
to display as a column to the right of the heat map with no color, just aligned total values per cell? I'm also trying to move the labels to the top. I don't mind that the labels on the left are horizontal as this does not occur with my actual data.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
%matplotlib inline
df = pd.DataFrame(
{'A' : ['A', 'A', 'B', 'B','C', 'C', 'D', 'D'],
'B' : ['A', 'B', 'A', 'B','A', 'B', 'A', 'B'],
'C' : [2, 4, 5, 2, 0, 3, 9, 1],
'D' : [6, 6, 7, 7, 3, 3, 10, 10]})
df=df.pivot('A','B','C')
fig, ax = plt.subplots(1, 1, figsize =(4,6))
sns.heatmap(df, annot=True, linewidths=0, cbar=False)
plt.show()
Here's the desired result:
Thanks in advance!
I think the cleanest way (although probably not the shortest), would be to plot Total as one of the columns, and then access colors of the facets of the heatmap and change some of them to white.
The element that is responsible for color on heatmap is matplotlib.collections.QuadMesh. It contains all facecolors used for each facet of the heatmap, from left to right, bottom to top.
You can modify some colors and pass them back to QuadMesh before you plt.show().
There is a slight problem that seaborn changes text color of some of the annotations to make them visible on dark background, and they become invisible when you change to white color. So for now I set color of all text to black, you will need to figure out what is best for your plots.
Finally, to put x axis ticks and label on top, use:
ax.xaxis.tick_top()
ax.xaxis.set_label_position('top')
The final version of the code:
import matplotlib.pyplot as plt
from matplotlib.collections import QuadMesh
from matplotlib.text import Text
import seaborn as sns
import pandas as pd
import numpy as np
%matplotlib inline
df = pd.DataFrame(
{'A' : ['A', 'A', 'B', 'B','C', 'C', 'D', 'D'],
'B' : ['A', 'B', 'A', 'B','A', 'B', 'A', 'B'],
'C' : [2, 4, 5, 2, 0, 3, 9, 1],
'D' : [6, 6, 7, 7, 3, 3, 10, 10]})
df=df.pivot('A','B','C')
# create "Total" column
df['Total'] = df['A'] + df['B']
fig, ax = plt.subplots(1, 1, figsize =(4,6))
sns.heatmap(df, annot=True, linewidths=0, cbar=False)
# find your QuadMesh object and get array of colors
quadmesh = ax.findobj(QuadMesh)[0]
facecolors = quadmesh.get_facecolors()
# make colors of the last column white
facecolors[np.arange(2,12,3)] = np.array([1,1,1,1])
# set modified colors
quadmesh.set_facecolors = facecolors
# set color of all text to black
for i in ax.findobj(Text):
i.set_color('black')
# move x ticks and label to the top
ax.xaxis.tick_top()
ax.xaxis.set_label_position('top')
plt.show()
P.S. I am on Python 2.7, some syntax adjustments might be required, though I cannot think of any.

Resources