Plotting Pandas DF with Numpy Arrays - python-3.x

I have a Pandas df with multiple columns and each cell inside has a various number of elements of a Numpy array. I would like plot all the elements of the array for every cell within column.
I have tried
plt.plot(df['column'])
plt.plot(df['column'][0:])
both gives a ValueErr: setting an array element with a sequence
It is very important that these values get plotted to its corresponding index as the index represents linear time in this dataframe. I would really appreciate it if someone showed me how to do this properly. Perhaps there is a package other than matplotlib.pylot that is better suited for this?
Thank you

plt.plot needs a list of x-coordinates together with an equally long list of y-coordinates. As you seem to want to use the index of the dataframe for the x-coordinate and each cell contents for the y-coordinates, you need to repeat the x-values as many times as the length of the y-coordinates.
Note that this format doesn't suit a line plot, as connecting subsequent points would create some strange vertical lines. plt.plot accepts a marker as its third parameter, for example '.' to draw a simple dot at each position.
A code example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
N = 30
df = pd.DataFrame({f'column{c}':
[np.random.normal(np.random.uniform(10, 100), 1, np.random.randint(3, 11)) for _ in range(N)]
for c in range(1, 6)})
legend_handles = []
colors = plt.cm.Set1.colors
desired_columns = df.columns
for column, color in zip(desired_columns, colors):
for ind, cell in df[column].iteritems():
if len(cell) > 0:
plotted, = plt.plot([ind] * len(cell), cell, '.', color=color)
legend_handles.append(plotted)
plt.legend(legend_handles, desired_columns)
plt.show()
Note that pandas really isn't meant to store complete arrays inside cells. The preferred way is to create a dataframe in "long" form, with each value in a separate row (with the "index" repeated). Most functions of pandas and seaborn don't understand about arrays inside cells.
Here's a way to create a long form which can be called using Seaborn:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
N = 30
df = pd.DataFrame({f'column{c}':
[np.random.normal(np.random.uniform(10, 100), 1, np.random.randint(3, 11)) for _ in range(N)]
for c in range(1, 6)})
desired_columns = df.columns
df_long_data = []
for column in desired_columns:
for ind, cell in df[column].iteritems():
for val in cell:
dict = {'timestamp': ind, 'column_name': column, 'value': val}
df_long_data.append(dict)
df_long = pd.DataFrame(df_long_data)
sns.scatterplot(x='timestamp', y='value', hue='column_name', data=df_long)
plt.show()

As per your problem, you have numpy arrays in each cell which you wanna plot. To pass your data to plt.plot() method you might need to pass every cell individually as whenever you try to pass it as a whole like you did, it is actually a sequence that you are passing. But the plot() method will accept a numpy array.
This might help:
for column in df.columns:
for cell in df[column]:
plt.plot(cell)
plt.show()

Related

Matplotlib: applying cellColours to only certain columns/cells

Got myself in a pickle.
I'm creating a basic table in Matplotlib (via Pandas, but that's not the issue). What I'm trying to accomplish is to create a table where the first column, which will be string values, remains white...but columns 2,3,4,5,6 are floating/integers and will be colored by a custom normalized colormap.
I've started with the basics, and created the 'colored' table via the code below. This only plots the columns with integer values at this point, see here:
What I ulimately need to do is plot this with an additional column, say before column 'A' or after column 'F' which holds string values, e.g. ['MBIAS', 'RMSE', 'BAGSS', 'MBIAS', 'MBIAS'].
However if I try to apply the cellColours method in the code below to a table that mixes lists of strings and float/integers, it obviously fails.
Is there a method to apply a cellColours scheme to only certain cells, or row/columns? Can I loop through, applying the custom colormap to specific cells?
Any help or tips would be appreciated!
Code:
import numpy as np
import matplotlib
from matplotlib import cm
import matplotlib.pyplot as plt
from pandas import *
#Create sample data in pandas dataframe
idx = Index(np.arange(1,6))
df = DataFrame(abs(2*np.random.randn(5, 5)), index=idx, columns=['A', 'B', 'C', 'D', 'E'])
model = ['conusarw', 'conusarw', 'conusarw', 'nam04', 'emhrrr']
df['Model'] = model
df1 = df[['A','B','C','D','E']]
test = df1.round({'A':2,'B':2,'C':2,'D':2,'E':2})
print(test)
vals = test.values
print(vals)
#Creates normalized list (from 0-1) based a user provided range and center of distribution.
norm = matplotlib.colors.TwoSlopeNorm(vmin=0,vcenter=1,vmax=10)
#Merges colormap to the normalized data based on customized normalization pattern from above.
colours = plt.cm.coolwarm(norm(vals))
#Create figure in Matplotlib in which to plot table.
fig = plt.figure(figsize=(15,8))
ax = fig.add_subplot(111, frameon=False, xticks=[], yticks=[])
#Plot table, using pandas dataframe information and data.
#Customized lists of data and names can also be provided.
the_table=plt.table(cellText=vals, rowLabels=model, colLabels=df.columns,
loc='center', cellColours=colours)
plt.savefig('test_table.png')
Instead of the fast vectorized call colours = plt.cm.coolwarm(norm(vals)), you can just use regular Python loops with if-tests. The code below loops through the individual rows, then through the individual elements and test whether they are numeric. A similar loop prepares the rounded values. Speed is not really a problem, unless you'd have thousands of elements.
(The code uses import pandas as pd, as import * from pandas isn't recommended.)
import matplotlib.pyplot as plt
from matplotlib.colors import to_rgba, TwoSlopeNorm
import pandas as pd
import numpy as np
# Create sample data in pandas dataframe
idx = pd.Index(np.arange(1, 6))
df = pd.DataFrame(abs(2 * np.random.randn(5, 5)), index=idx, columns=['A', 'B', 'C', 'D', 'E'])
df['Model'] = ['conusarw', 'conusarw', 'conusarw', 'nam04', 'emhrrr']
cmap = plt.cm.coolwarm
norm = TwoSlopeNorm(vmin=0, vcenter=1, vmax=10)
colours = [['white' if not np.issubdtype(type(val), np.number) else cmap(norm(val)) for val in row]
for row in df.values]
vals = [[val if not np.issubdtype(type(val), np.number) else np.round(val, 2) for val in row]
for row in df.values]
fig = plt.figure(figsize=(15, 8))
ax = fig.add_subplot(111, frameon=False, xticks=[], yticks=[])
the_table = plt.table(cellText=vals, rowLabels=df['Model'].to_list(), colLabels=df.columns,
loc='center', cellColours=colours)
plt.show()
PS: If speed is a concern, the following code is a bit trickier. It uses:
setting the "bad color" of a colormap
pd.to_numeric(..., errors='coerce') to convert all strings to nans
as pd.to_numeric() only works for 1D arrays, ravel() and reshape() are used
using the same arrays, np.where can do the rounding
cmap = plt.cm.coolwarm.copy()
cmap.set_bad('white')
norm = TwoSlopeNorm(vmin=0, vcenter=1, vmax=10)
values = pd.to_numeric(df.values.ravel(), errors='coerce').reshape(df.shape)
colours = cmap(norm(values))
vals = np.where(np.isnan(values), df.values, np.round(values, 2))
fig = plt.figure(figsize=(15, 8))
ax = fig.add_subplot(111, frameon=False, xticks=[], yticks=[])
the_table = plt.table(cellText=vals, rowLabels=df['Model'].to_list(), colLabels=df.columns,
loc='center', cellColours=colours)

Break a pandas line plot at specific date

I have a time-series dataframe with missing data for some time period. I would like to create a line plot and break a line where there is missing data.
data_site1_ave[["samples", "lkt"]].plot(figsize=(15,4), title = "Site 1", xlabel='')
Is it possible to create a gap, let's say from 2018-05-01 to 2018-10-30 in the line plot?
Yes, you can create arbitrary gaps by simply calling df.plot() several times, on the appropriate slices of the full dataframe. To make everything appear in the same plot, you can pass the ax keyword argument to plt.plot() via the df.plot() method. Turn the legend off for all but one call, so that the legend only has the one entry.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# create sample time series
N = 365
np.random.seed(42)
x = pd.date_range('2018-01-01', freq='d', periods=N)
y = np.cumsum(np.random.rand(N, 1) - 0.5)
df = pd.DataFrame(y, columns=['y'], index=x)
# plot time series with gap
fig, ax = plt.subplots()
df.loc[:'2018-05-01'].plot(ax=ax, c='blue')
df.loc['2018-10-31':].plot(ax=ax, c='blue', legend=False);

Pandas - comparing average of hour periods against each other for a given date range

I'm trying to get used to using datetime data in Pandas and plotting different comparisons for a given dataset. I'm using the London Air Quality dataset for Ozone to practice and am trying to replicate the chart below (that I've created using a pivot table in Excel) with Pandas and matplotlib.
The chart plots an average of each hours Ozone reading for each location across the entire dataset to see if there is one location which is constantly higher than others or if different locations have the highest Ozone levels at different periods throughout the day.
Essentially, I'm looking to plot the hourly average of Ozone for each location.
I've attempted to reshape the data into a multiindex format and then plot, similar to what I'd do in excel before plotting but am unsure if this is the correct way to approach the problem. Code for reshaping is below. I am still getting used to reshaping so not sure if this is the correct use/I am approaching the problem in the correct way and open to other methods to accomplish this task. Any assistance to accomplish this task would be much appreciated!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
data = pd.read_csv('/Users/xx/Downloads/LaqnData.csv')
data['ReadingDateTime'] = pd.to_datetime(data['ReadingDateTime'])
data['Date'] = pd.to_datetime(data['ReadingDateTime']).dt.date
data['Time'] = pd.to_datetime(data['ReadingDateTime']).dt.time
data.set_index(['Date', 'Time'], inplace = True)
hourly_dataframe = data.pivot_table(columns = 'Site', values = 'Value', index = ['Date', 'Time'])
hourly_dataframe.fillna(method = 'ffill', inplace = True)
hourly_dataframe[hourly_dataframe < 0] = 0
I have gone to the site and downloaded a 24 hour reading for the following;
data.Site.unique()
array(['BX1', 'TH4', 'BT4', 'HI0', 'BL0', 'RD0'], dtype=object)
I adopted your code to this point:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
data = pd.read_csv('/Users/xx/Downloads/LaqnData.csv')
data['ReadingDateTime'] = pd.to_datetime(data['ReadingDateTime'])
I then use datetime index to call each hour in the groupby function.
data.groupby([data.index.hour, data['Site']])['Value'].mean().reset_index()`#Convert to dataframe.`
To plot, I chain unstack to the groupby function and plot directly.
data.groupby([data.index.hour, data['Site']])['Value'].mean().reset_index#unstack().plot()
plt.xlabel('Hour of the day')
plt.ylabel('Ozone')
plt.title('Avarage Hourly comparison')
plt.legend()`# If you want the legend to appear in default location`
If fussed about legend location, this post explains it very well. In your case;
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=6)

Set hue using a range of values in Seaborn stripplot

I am trying to set hue based on a range of values rather than unique values in seaborn stripplot. For example, different colors for different value ranges (1940-1950, 1950-1960 etc.).
sns.stripplot('Condition', 'IM', data=dd3, jitter=0.3, hue= dd3['Year Built'])
Output Figure
Thanks
Looks like you need to bin the data. Use .cut() in the below manner. The years are binned into 5 groups. You can arrange your own step in .arrange() to adjust your ranges.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
x = np.random.randint(0,100,size=100)
y = np.random.randint(0,100, size=100)
year = np.random.randint(1918, 2019, size=100)
df = pd.DataFrame({
'x':x,
'y':y,
'year':year
})
df['year_bin'] = pd.cut(df['year'], np.arange(min(year), max(year), step=20))
sns.lmplot('x','y', data=df, hue='year_bin')
plt.show()
Output:

How do I map df column values to hex color in one go?

I have a pandas dataframe with two columns. One of the columns values needs to be mapped to colors in hex. Another graphing process takes over from there.
This is what I have tried so far. Part of the toy code is taken from here.
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(7, 2)), columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan
# Try to map values to colors in hex
# # Taken from here
norm = matplotlib.colors.Normalize(vmin=0, vmax=21, clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
df['some_value_color'] = df['some_value'].apply(lambda x: mapper.to_rgba(x))
df
Which outputs:
How do I convert 'some_value' df column values to hex in one go?
Ideally using the sns.cubehelix_palette(light=1)
I am not opposed to using something other than matplotlib
Thanks in advance.
You may use matplotlib.colors.to_hex() to convert a color to hexadecimal representation.
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(7, 2)), columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan
# Try to map values to colors in hex
# # Taken from here
norm = matplotlib.colors.Normalize(vmin=0, vmax=21, clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
df['some_value_color'] = df['some_value'].apply(lambda x: mcolors.to_hex(mapper.to_rgba(x)))
df
Efficiency
The above method it easy to use, but may not be very efficient. In the folling let's compare some alternatives.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
def create_df(n=10):
# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(n, 2)),
columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan
return df
The following is the solution from above. It applies the conversion to the dataframe row by row. This quite inefficient.
def apply1(df):
# map values to colors in hex via
# matplotlib to_hex by pandas apply
norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values),
vmax=np.nanmax(df['some_value'].values), clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
df['some_value_color'] = df['some_value'].apply(lambda x: mcolors.to_hex(mapper.to_rgba(x)))
return df
That's why we might choose to calculate the values into a numpy array first and just assign this array as the newly created column.
def apply2(df):
# map values to colors in hex via
# matplotlib to_hex by assigning numpy array as column
norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values),
vmax=np.nanmax(df['some_value'].values), clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
a = mapper.to_rgba(df['some_value'])
df['some_value_color'] = np.apply_along_axis(mcolors.to_hex, 1, a)
return df
Finally we may use a look up table (LUT) which is created from the matplotlib colormap, and index the LUT by the normalized data. Because this solution needs to create the LUT first, it is rather ineffienct for dataframes with less entries than the LUT has colors, but will pay off for large dataframes.
def apply3(df):
# map values to colors in hex via
# creating a hex Look up table table and apply the normalized data to it
norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values),
vmax=np.nanmax(df['some_value'].values), clip=True)
lut = plt.cm.viridis(np.linspace(0,1,256))
lut = np.apply_along_axis(mcolors.to_hex, 1, lut)
a = (norm(df['some_value'].values)*255).astype(np.int16)
df['some_value_color'] = lut[a]
return df
Compare the timings
Let's take a dataframe with 10000 rows.
df = create_df(10000)
Original solution (apply1)
%timeit apply1(df)
2.66 s per loop
Array solution (apply2)
%timeit apply2(df)
240 ms per loop
LUT solution (apply3)
%timeit apply1(df)
7.64 ms per loop
In this case the LUT solution gives almost a factor 400 of improvement.

Resources