I have time series data which are multi-indexed on (Year, Month) as seen here:
print(df.index)
print(df)
MultiIndex(levels=[[2016, 2017], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
labels=[[0, 0, 0, 0, 0, 0, 0, 0], [2, 3, 4, 5, 6, 7, 8, 9]],
names=['Year', 'Month'])
Value
Year Month
2016 3 65.018150
4 63.130035
5 71.071254
6 72.127967
7 67.357795
8 66.639228
9 64.815232
10 68.387698
I want to do very basic linear regression on these time series data. Because pandas.DataFrame.plot does not do any regression, I intend to use Seaborn to do my plotting.
I attempted to do this by using lmplot:
sns.lmplot(x=("Year", "Month"), y="Value", data=df, fit_reg=True)
but I get an error:
TypeError: '>' not supported between instances of 'str' and 'tuple'
This is particularly interesting to me because all elements in df.index.levels[:] are of type numpy.int64, all elements in df.index.labels[:] are of type numpy.int8.
Why am I receiving this error? How can I resolve it?
You can use reset_index to turn the dataframe's index into columns. Plotting DataFrames columns is then straight forward with seaborn.
As I guess the reason to use lmplot would be to show different regressions for different years (otherwise a regplot may be better suited), the "Year"column can be used as hue.
import numpy as np
import pandas as pd
import seaborn.apionly as sns
import matplotlib.pyplot as plt
iterables = [[2016, 2017], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]]
index = pd.MultiIndex.from_product(iterables, names=['Year', 'Month'])
df = pd.DataFrame({"values":np.random.rand(24)}, index=index)
df2 = df.reset_index() # or, df.reset_index(inplace=True) if df is not required otherwise
g = sns.lmplot(x="Month", y="values", data=df2, hue="Year")
plt.show()
Consider the following approach:
df['x'] = df.index.get_level_values(0) + df.index.get_level_values(1)/100
yields:
In [49]: df
Out[49]:
Value x
Year Month
2016 3 65.018150 2016.03
4 63.130035 2016.04
5 71.071254 2016.05
6 72.127967 2016.06
7 67.357795 2016.07
8 66.639228 2016.08
9 64.815232 2016.09
10 68.387698 2016.10
let's prepare X-ticks labels:
labels = df.index.get_level_values(0).astype(str) + '-' + \
df.index.get_level_values(1).astype(str).str.zfill(2)
sns.lmplot(x='x', y='Value', data=df, fit_reg=True)
ax = plt.gca()
ax.set_xticklabels(labels)
Result:
Related
I have written a function that reshapes the numpy array per given window length.
The function does the following.
My function as follows:
def w_s(df, l):
"""
Convert numpy array into desired shape with lag 1.
Args:
df (numpy.ndarray): Numpy array.
l (integer): Length of the sample window.
Returns:
Returns numpy array in a desired shape to be used in decision trees.
"""
data = np.zeros((l, 1))
data = np.append(data, df)
data = data[l:]
for i in range(1, l):
s1 = np.roll(df,0-i)
data = np.append(data,s1)
data = data.reshape(l, len(df)).T
return data[:-(l-1)]
I have two arrays with the length of 1780000. The function takes around 3 hours.
CPU times: user 5min 20s, sys: 43min 45s, total: 49min 6s Wall time: 3h 5min 46s
My machine is Mac M1. I am running this on Jupyter cell where server runs on Firefox. How can I do this faster?
That's sliding_window_view:
>>> import numpy as np
>>> arr = np.arange(10)
>>> np.lib.stride_tricks.sliding_window_view(arr, 5)
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[3, 4, 5, 6, 7],
[4, 5, 6, 7, 8],
[5, 6, 7, 8, 9]])
Say I have an example dataframe below:
Division Home Corners Away Corners
Bundesliga 5 3
Bundesliga 5 5
EPL 7 4
EPL 3 2
League 1 10 6
Serie A 3 3
Serie A 8 2
League 1 3 1
I want to create a boxplot of total corners per game grouped by divison, but I want the home corners and away Corners to be separated but on the same figure. Similar to what the "hue" keyword accomplishes, but how do I accomplish that?
seaborn.boxplot
Reshape the data to a long form with pandas.DataFrame.stack
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = {'Division': ['Bundesliga', 'Bundesliga', 'EPL', 'EPL', 'League 1', 'Serie A', 'Serie A', 'League 1'],
'Home Corners': [5, 5, 7, 3, 10, 3, 8, 3],
'Away Corners ': [3, 5, 4, 2, 6, 3, 2, 1]}
df = pd.DataFrame(data)
# convert the data to a long format
df.set_index('Division', inplace=True)
dfl = df.stack().reset_index().rename(columns={'level_1': 'corners', 0: 'val'})
# plot
sns.boxplot('corners', 'val', data=dfl, hue='Division')
plt.legend(title='Division', bbox_to_anchor=(1.05, 1), loc='upper left')
You can melt the original data and use sns.boxplot:
sns.boxplot(data=df.melt('Division', var_name='Home/Away', value_name='Corners'),
x='Division', y='Corners',hue='Home/Away')
Output:
I have a dataframe like this:
mid value label
ID
192 3 176.6 [9, 6, 8, 0, 8, 8, 7, 9, 2, 19...
192 4 73.6 [9, 6, 8, 0, 8, 8, 7, 9, 2, 19...
192 5 15.8 [9, 6, 8, 0, 8, 8, 7, 9, 2, 19...
194 3 9603.2 [0, 0, 0, 0, 0, 9, 6, 1, 8, ...
I want to implement MultiLabelBinarizer after removing the duplicate values in each list of label column.
I have tried by looping the frame and removing duplicates. and also, the multilabel binarizer doesnt work and throws an exception
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit(y_train.data)
X_train includes the mid and value columns
y_train includes label values
id is the index
I expect a prediction from the above values after the duplicate values are removed from each list of label column
Let's assume your dataframe is named df:
df2 = pd.DataFrame(df.groupby(['ID','mid', 'value'])['label'].apply(lambda x: tuple(x.values)))
df2.reset_index(inplace=True)
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit(df2['label'])
mlb.transform(df2['label'])
I am trying to create a series of graphs that share x and y labels. I can get the graphs to each have a label (explained well here!), but this is not what I am looking for.
I want one label that covers the y axis of both graphs, and same for the x axis.
I've been looking at the matplotlib and pandas documentation and I was unable to find anything that addresses this issues when the using by argument.
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 1, 2, 3, 4, 3, 4],
'B': [1, 7, 2, 4, 1, 4, 8, 3],
'C': [1, 4, 8, 3, 1, 7, 3, 4],
'D': [1, 2, 6, 5, 8, 3, 1, 7]},
index=[0, 1, 2, 3, 5, 6, 7, 8])
histo = df.hist(by=df['A'], sharey=True, sharex=True)
plt.ylabel('ylabel') # I assume the label is created on the 4th graph and then deleted?
plt.xlabel('xlabel') # Creates a label on the 4th graph.
plt.tight_layout()
plt.show()
The ouput looks like this.
Is there any way that I can create a Y Label that goes across the entire left side of the image (not each graph individually) and the same for the X Label.
As you can see, the x label only appears on the last graph created, and there is no y label.
Help?
This is one way to do it indirectly using the x- and y-labels as texts. I am not aware of a direct way using plt.xlabel or plt.ylabel. When passing an axis object to df.hist, the sharex and sharey arguments have to be passed in plt.subplots(). Here you can manually control/specify the position where you want to put the labels. For example, if you think the x-label is too close to the ticks, you can use 0.5, -0.02, 'X-label' to shift it slightly below.
import matplotlib.pyplot as plt
import pandas as pd
f, ax = plt.subplots(2, 2, figsize=(8, 6), sharex=True, sharey=True)
df = pd.DataFrame({'A': [1, 2, 1, 2, 3, 4, 3, 4],
'B': [1, 7, 2, 4, 1, 4, 8, 3],
'C': [1, 4, 8, 3, 1, 7, 3, 4],
'D': [1, 2, 6, 5, 8, 3, 1, 7]},
index=[0, 1, 2, 3, 5, 6, 7, 8])
histo = df.hist(by=df['A'], ax=ax)
f.text(0, 0.5, 'Y-label', ha='center', va='center', fontsize=20, rotation='vertical')
f.text(0.5, 0, 'X-label', ha='center', va='center', fontsize=20)
plt.tight_layout()
I fixed the issue with the variable number of sub-plots using something like this:
cols = 3
n = len(set(df['A']))
rows = int(n / cols) + (0 if n % cols == 0 else 1)
fig, axes = plt.subplots(rows, cols)
extra = rows * cols - n
if extra:
newaxes = []
count = 0
for row in range(rows):
for col in range(cols):
if count < n:
newaxes.append(axes[row][col])
else:
axes[row][col].axis('off')
count += 1
else:
newaxes = axes
hist = df.hist(by=df['A'], ax=newaxes)
I have the following data frame my_df:
my_1 my_2 my_3
--------------------------------
0 5 7 4
1 3 5 13
2 1 2 8
3 12 9 9
4 6 1 2
I want to make a plot where x-axis is categorical values with my_1, my_2, and my_3. y-axis is integer. For each column in my_df, I want to plot all its 5 values at x = my_i. What kind of plot should I use in matplotlib? Thanks!
You could make a bar chart:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'my_1': [5, 3, 1, 12, 6], 'my_2': [7, 5, 2, 9, 1], 'my_3': [4, 13, 8, 9, 2]})
df.T.plot(kind='bar')
plt.show()
or a scatter plot:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'my_1': [5, 3, 1, 12, 6], 'my_2': [7, 5, 2, 9, 1], 'my_3': [4, 13, 8, 9, 2]})
fig, ax = plt.subplots()
cols = np.arange(len(df.columns))
x = np.repeat(cols, len(df))
y = df.values.ravel(order='F')
color = np.tile(np.arange(len(df)), len(df.columns))
scatter = ax.scatter(x, y, s=150, c=color)
ax.set_xticks(cols)
ax.set_xticklabels(df.columns)
cbar = plt.colorbar(scatter)
cbar.set_ticks(np.arange(len(df)))
plt.show()
Just for fun, here is how to make the same scatter plot using Pandas' df.plot:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'my_1': [5, 3, 1, 12, 6], 'my_2': [7, 5, 2, 9, 1], 'my_3': [4, 13, 8, 9, 2]})
columns = df.columns
index = df.index
df = df.stack()
df.index.names = ['color', 'column']
df = df.rename('y').reset_index()
df['x'] = pd.Categorical(df['column']).codes
ax = df.plot(kind='scatter', x='x', y='y', c='color', colorbar=True,
cmap='viridis', s=150)
ax.set_xticks(np.arange(len(columns)))
ax.set_xticklabels(columns)
cbar = ax.collections[-1].colorbar
cbar.set_ticks(index)
plt.show()
Unfortunately, it requires quite a bit of DataFrame manipulation just to call
df.plot and then there are some extra matplotlib calls needed to set the tick
marks on the scatter plot and colorbar. Since Pandas is not saving effort here,
I would go with the first (NumPy/matplotlib) approach shown above.