Float format in matplotlib table - python-3.x

Given a pandas dataframe, I am trying to translate it into a table by using this code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = {"Name": ["John", "Leonardo", "Chris", "Linda"],
"Location" : ["New York", "Florence", "Athens", "London"],
"Age" : [41, 33, 53, 22],
"Km": [1023,2312,1852,1345]}
df = pd.DataFrame(data)
fig, ax = plt.subplots()
ax.axis('off')
ax.set_title("Table", fontsize=16, weight='bold')
table = ax.table(cellText=df.values,
bbox=[0, 0, 1.5, 1],
cellLoc='center',
colLabels=df.columns)
And it works. However I can figure out how to set the format for numbers as {:,.2f}, that is, with commas as thousands separators and two decimals.
Any suggestion?

Insert the following two lines of code after df is created and the rest of your code works as desired.
The Age and Km columns are defined as type int; convert these to float before using your str.format:
df.update(df[['Age', 'Km']].astype(float))
Now use DataFrame.applymap(str.format) on these two columns:
df.update(df[['Age', 'Km']].applymap('{:,.2f}'.format))

Related

Pandas Series of dates to vlines kwarg in mplfinance plot

import numpy as np
import pandas as pd
df = pd.DataFrame({'dt': ['2021-2-13', '2022-2-15'],
'w': [5, 7],
'n': [11, 8]})
df.reset_index()
print(list(df.loc[:,'dt'].values))
gives: ['2021-2-13', '2022-2-15']
NEEDED: [('2021-2-13'), ('2022-2-15')]
Important (at comment's Q): "NEEDED" is the way "mplfinance" accepts vlines argument for plot (checked) - I need to draw vertical lines for specified dates at x-axis of chart
import mplfinance as mpf
RES['Date'] = RES['Date'].dt.strftime('%Y-%m-%d')
my_vlines=RES.loc[:,'Date'].values # NOT WORKS
fig, axlist = mpf.plot( ohlc_df, type="candle", vlines= my_vlines, xrotation=30, returnfig=True, figsize=(6,4))
will only work if explcit my_vlines= [('2022-01-18'), ('2022-02-25')]
SOLVED: Oh, it really appears to be so simple after all
my_vlines=list(RES.loc[:,'Date'].values)
Your question asks for a list of Numpy arrays but your desired output looks like Tuples. If you need Tuples, note that it's the comma that makes the tuple not the parentheses, so you'd do something like this:
desired_format = [(x,) for x in list(df.loc[:,'dt'].values)]
If you want numpy arrays, you could do this
desired_format = [np.array(x) for x in list(df.loc[:,'dt'].values)]
I think I understand your problem. Please see the example code below and let me know if this resolves your problem. I expanded on your dataframe to meet mplfinance plot criteria.
import pandas as pd
import numpy as np
import mplfinance as mpf
df = pd.DataFrame({'dt': ['2021-2-13', '2022-2-15'],'Open': [5,7],'Close': [11, 8],'High': [21,30],'Low': [7, 3]})
df['dt']=pd.to_datetime(df['dt'])
df.set_index('dt', inplace = True)
mpf.plot(df, vlines = dict(vlines = df.index.tolist()))

Matplotlib: applying cellColours to only certain columns/cells

Got myself in a pickle.
I'm creating a basic table in Matplotlib (via Pandas, but that's not the issue). What I'm trying to accomplish is to create a table where the first column, which will be string values, remains white...but columns 2,3,4,5,6 are floating/integers and will be colored by a custom normalized colormap.
I've started with the basics, and created the 'colored' table via the code below. This only plots the columns with integer values at this point, see here:
What I ulimately need to do is plot this with an additional column, say before column 'A' or after column 'F' which holds string values, e.g. ['MBIAS', 'RMSE', 'BAGSS', 'MBIAS', 'MBIAS'].
However if I try to apply the cellColours method in the code below to a table that mixes lists of strings and float/integers, it obviously fails.
Is there a method to apply a cellColours scheme to only certain cells, or row/columns? Can I loop through, applying the custom colormap to specific cells?
Any help or tips would be appreciated!
Code:
import numpy as np
import matplotlib
from matplotlib import cm
import matplotlib.pyplot as plt
from pandas import *
#Create sample data in pandas dataframe
idx = Index(np.arange(1,6))
df = DataFrame(abs(2*np.random.randn(5, 5)), index=idx, columns=['A', 'B', 'C', 'D', 'E'])
model = ['conusarw', 'conusarw', 'conusarw', 'nam04', 'emhrrr']
df['Model'] = model
df1 = df[['A','B','C','D','E']]
test = df1.round({'A':2,'B':2,'C':2,'D':2,'E':2})
print(test)
vals = test.values
print(vals)
#Creates normalized list (from 0-1) based a user provided range and center of distribution.
norm = matplotlib.colors.TwoSlopeNorm(vmin=0,vcenter=1,vmax=10)
#Merges colormap to the normalized data based on customized normalization pattern from above.
colours = plt.cm.coolwarm(norm(vals))
#Create figure in Matplotlib in which to plot table.
fig = plt.figure(figsize=(15,8))
ax = fig.add_subplot(111, frameon=False, xticks=[], yticks=[])
#Plot table, using pandas dataframe information and data.
#Customized lists of data and names can also be provided.
the_table=plt.table(cellText=vals, rowLabels=model, colLabels=df.columns,
loc='center', cellColours=colours)
plt.savefig('test_table.png')
Instead of the fast vectorized call colours = plt.cm.coolwarm(norm(vals)), you can just use regular Python loops with if-tests. The code below loops through the individual rows, then through the individual elements and test whether they are numeric. A similar loop prepares the rounded values. Speed is not really a problem, unless you'd have thousands of elements.
(The code uses import pandas as pd, as import * from pandas isn't recommended.)
import matplotlib.pyplot as plt
from matplotlib.colors import to_rgba, TwoSlopeNorm
import pandas as pd
import numpy as np
# Create sample data in pandas dataframe
idx = pd.Index(np.arange(1, 6))
df = pd.DataFrame(abs(2 * np.random.randn(5, 5)), index=idx, columns=['A', 'B', 'C', 'D', 'E'])
df['Model'] = ['conusarw', 'conusarw', 'conusarw', 'nam04', 'emhrrr']
cmap = plt.cm.coolwarm
norm = TwoSlopeNorm(vmin=0, vcenter=1, vmax=10)
colours = [['white' if not np.issubdtype(type(val), np.number) else cmap(norm(val)) for val in row]
for row in df.values]
vals = [[val if not np.issubdtype(type(val), np.number) else np.round(val, 2) for val in row]
for row in df.values]
fig = plt.figure(figsize=(15, 8))
ax = fig.add_subplot(111, frameon=False, xticks=[], yticks=[])
the_table = plt.table(cellText=vals, rowLabels=df['Model'].to_list(), colLabels=df.columns,
loc='center', cellColours=colours)
plt.show()
PS: If speed is a concern, the following code is a bit trickier. It uses:
setting the "bad color" of a colormap
pd.to_numeric(..., errors='coerce') to convert all strings to nans
as pd.to_numeric() only works for 1D arrays, ravel() and reshape() are used
using the same arrays, np.where can do the rounding
cmap = plt.cm.coolwarm.copy()
cmap.set_bad('white')
norm = TwoSlopeNorm(vmin=0, vcenter=1, vmax=10)
values = pd.to_numeric(df.values.ravel(), errors='coerce').reshape(df.shape)
colours = cmap(norm(values))
vals = np.where(np.isnan(values), df.values, np.round(values, 2))
fig = plt.figure(figsize=(15, 8))
ax = fig.add_subplot(111, frameon=False, xticks=[], yticks=[])
the_table = plt.table(cellText=vals, rowLabels=df['Model'].to_list(), colLabels=df.columns,
loc='center', cellColours=colours)

How to annotate with multiple columns using mplcursors

Below is code for a scatter plot annotated using mplcursors which uses two columns, labeling the points by a third column.
How can two values from two columns from a single dataframe be selected for annotation text in a single text box?
When instead of only "name" in the annotation text box, I would like both "height" and "name" to show in the annotation text box. Using df[['height', 'name']] does not work.
How can this be achieved otherwise?
df = pd.DataFrame(
[("Alice", 163, 54),
("Bob", 174, 67),
("Charlie", 177, 73),
("Diane", 168, 57)],
columns=["name", "height", "weight"])
df.plot.scatter("height", "weight")
mplcursors.cursor(multiple = True).connect("add", lambda sel: sel.annotation.set_text((df["name"])[sel.target.index]))
plt.show()
df.loc[sel.target.index, ["name", 'height']].to_string(): correctly select the columns and row with .loc and then create a string with .to_string()
Tested in python 3.8, matplotlib 3.4.2, pandas 1.3.1, and jupyterlab 3.1.4
In mplcursors v0.5.1, Selection.target.index is deprecated, use Selection.index instead.
df.iloc[x.index, :] instead of df.iloc[x.target.index, :]
from mplcursors import cursor
import matplotlib.pyplot as plt
import pandas as pd
# for interactive plots in Jupyter Lab, use the following magic command, otherwise comment it out
%matplotlib qt
df = pd.DataFrame([('Alice', 163, 54), ('Bob', 174, 67), ('Charlie', 177, 73), ('Diane', 168, 57)], columns=["name", "height", "weight"])
ax = df.plot(kind='scatter', x="height", y="weight", c='tab:blue')
cr = cursor(ax, hover=True, multiple=True)
cr.connect("add", lambda sel: sel.annotation.set_text((df.loc[sel.index, ["name", 'height']].to_string())))
plt.show()

Iterating over columns from two dataframes to estimate correlation and p-value

I am trying to estimate Pearson's correlation coefficient and P-value from the corresponding columns of two dataframes. I managed to write this code so far but it is just providing me the results from the last columns. Need some help with this code. Also, want to save the outputs in a new dataframe.
import os
import pandas as pd
import scipy as sp
import scipy.stats as stats
df_1 = pd.DataFrame(pd.read_excel('15_Oct_Yield_A.xlsx'))
df_2= pd.DataFrame(pd.read_excel('Oct_Z_index.xlsx'))
for column in df_1.columns[1:]:
for column in df_2.columns[1:]:
x = (df_1[column])
y = (df_2[column])
correl = stats.pearsonr(x, y)
Your looping setup is incorrect on a couple measures... You are using the same variable name in both for-loops which is going to cause problems. Also, you are computing correl outside of your inner loop... etc.
What you want to do is loop over the columns with 1 loop, assuming that both data frames have the same column names. If they do not, you will need to take extra steps to find the common column names and then iterate over them.
Something like this should work:
import os
import pandas as pd
import scipy as sp
import scipy.stats as stats
df_1 = pd.DataFrame({ 'A': ['dog', 'pig', 'cat'],
'B': [0.25, 0.50, 0.75],
'C': [0.30, 0.40, 0.90]})
df_2 = pd.DataFrame({ 'A': ['bird', 'monkey', 'rat'],
'B': [0.20, 0.60, 0.90],
'C': [0.80, 0.50, 0.10]})
results = dict()
for column in df_1.columns[1:]:
correl = stats.pearsonr(df_1[column], df_2[column])
results[column] = correl
print(results)

Converting string to date in numpy unpack

I'm learning how to extract data from links and then proceeding to graph them.
For this tutorial, I was using the yahoo dataset of a stock.
The code is as follows
import matplotlib.pyplot as plt
import numpy as np
import urllib
import matplotlib.dates as mdates
import datetime
def bytespdate2num(fmt, encoding='utf-8'):
strconverter = mdates.strpdate2num(fmt)
def bytesconverter(b):
s = b.decode(encoding)
return strconverter(s)
return bytesconverter
def graph_data(stock):
stock_price_url = 'https://pythonprogramming.net/yahoo_finance_replacement'
source_code = urllib.request.urlopen(stock_price_url).read().decode()
stock_data = []
split_source=source_code.split('\n')
print(len(split_source))
for line in split_source:
split_line=line.split(',')
if (len(split_line)==7):
stock_data.append(line)
date,openn,closep,highp,lowp,openp,volume=np.loadtxt(stock_data,delimiter=',',unpack=True,converters={0:bytespdate2num('%Y-%m-%d')})
plt.plot_date(date,closep)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Graph')
plt.show()
graph_data('TSLA')
The whole code is pretty easy to understand except the part of converting the string datatype into date format using bytesupdate2num function.
Is there an easier way to convert strings extracted from reading a URL into date format during numpy extraction or is there another method I can use.
Thank you
With a guess as to the csv format, I can use the numpy 'native' datetime dtype:
In [183]: txt = ['2020-10-23 1 2.3']*3
In [184]: txt
Out[184]: ['2020-10-23 1 2.3', '2020-10-23 1 2.3', '2020-10-23 1 2.3']
If I let genfromtxt do its own dtype conversions:
In [187]: np.genfromtxt(txt, dtype=None, encoding=None)
Out[187]:
array([('2020-10-23', 1, 2.3), ('2020-10-23', 1, 2.3),
('2020-10-23', 1, 2.3)],
dtype=[('f0', '<U10'), ('f1', '<i8'), ('f2', '<f8')])
the date column is rendered as a string.
If I specify a datetime64 format:
In [188]: np.array('2020-10-23', dtype='datetime64[D]')
Out[188]: array('2020-10-23', dtype='datetime64[D]')
In [189]: np.genfromtxt(txt, dtype=['datetime64[D]',int,float], encoding=None)
Out[189]:
array([('2020-10-23', 1, 2.3), ('2020-10-23', 1, 2.3),
('2020-10-23', 1, 2.3)],
dtype=[('f0', '<M8[D]'), ('f1', '<i8'), ('f2', '<f8')])
This date appears to work in plt
In [190]: plt.plot_date(_['f0'], _['f1'])
I used genfromtxt because I'm more familiar with its ability to handle dtypes.

Resources