Create a line graph per bin in Python 3 - python-3.x

I have a dataframe called 'games':
Game_id Goals P_value
1 2 0.4
2 3 0.321
45 0 0.64
I need to split the P value to 0.05 steps, bin the rows per P value and than create a line graph that shows the sum per p value.
What I currently have:
games.set_index('p value', inplace=True)
games.sort_index()
np.cumsum(games['goals']).plot()
But I get this:
No matter what I tried, I couldn't group the P values and show the sum of goals per P value..
I also tried to use matplotlib.pyplot but than I couldn't use the cumsum function..

If I understood you correctly, you want to have discrete steps in the p-value of width 0.05 and show the cumulative sum?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# create some random example data
df = pd.DataFrame({
'goals': np.random.poisson(3, size=1000),
'p_value': np.random.uniform(0, 1, size=1000)
})
# define binning in p-value
bin_edges = np.arange(0, 1.025, 0.05)
bin_center = 0.5 * (bin_edges[:-1] + bin_edges[1:])
bin_width = np.diff(bin_edges)
# find the p_value bin, each row belongs to
# 0 is underflow, len(edges) is overflow bin
df['bin'] = np.digitize(df['p_value'], bins=bin_edges)
# get the number of goals per p_value bin
goals_per_bin = df.groupby('bin')['goals'].sum()
print(goals_per_bin)
# not every bin might be filled, so we will use pandas index
# matching t
binned = pd.DataFrame({
'center': bin_center,
'width': bin_width,
'goals': np.zeros(len(bin_center))
}, index=np.arange(1, len(bin_edges)))
binned['goals'] = goals_per_bin
plt.step(
binned['center'],
binned['goals'],
where='mid',
)
plt.xlabel('p-value')
plt.ylabel('goals')
plt.show()

Related

Bin and count data to normalize Y axis

I have this area and diameter data:
DIAMETER AREA
0 3.039085 1230000
1 2.763617 1230000
2 2.052176 1230000
3 9.498093 1230000
4 2.680360 1230000
I want to bin the data by 1 (2-3,3-4 etc.) and count the number of diameters that are in those bins so that it's organized something like this:
2-3 3-4 4-5 5-6 6-7 7-8 8-9 9-10
3 1 0 0 0 0 0 1
My end goal is to then grab these counts and divide them by the area in order to normalize the counts.
Lastly I will plot the normalized counts (y) by the bins (x).
I tried to use a method using pd.cut but it didn't work
To do this you are wanting a histogram. This can easily be achieved with the hist method of your pandas DataFrame (this itself, just uses the hist plot method from Matplotlib, which uses NumPy's histogram function), for example:
import pandas as pd
from matplotlib import pyplot as plt
# create a version of the data for the example
data = pd.DataFrame({"DIAMETER": [3.039085, 2.763617, 2.052176, 9.498093, 2.680360]})
fig, ax = plt.subplots() # create a figure to plot onto
bins = [2, 3, 4, 5, 6, 7, 8, 9, 10] # set the histogram bin edges
data.hist(
column="DIAMETER", # the column to histogram
bins=bins, # set the bin edges
density=True, # this normalises the histogram
ax=ax, # the Matplotlib axis onto which to plot
)
fig.show() # show the plot
This gives:
where the Pandas hist function will automatically add a plot title based on the column name.
If you don't specify the bins keyword, then it will automatically generate 10 bins bounded by the range of your data, but these will not necessarily be integer spaced. If you wanted to ensure integer spaced bins for arbitrary data, then you could set the bins with:
import numpy as np
bins = np.arange(
np.floor(data["DIAMETER"].min()),
np.ceil(data["DIAMETER"].max()) + 1,
)
If you want a plot like suggested in your other post, then purely with NumPy and Matplotlib you could do:
import numpy as np
from matplotlib import pyplot as plt
# set bins
bins = np.arange(
np.floor(data["DIAMETER"].min()) - 1, # subtract 1, so we have a zero count bin at the start
np.ceil(data["DIAMETER"].max()) + 2, # add 2, so we have a zero count bin at the end
)
# generate histogram
counts, _ = np.histogram(
data["DIAMETER"],
bins=bins,
density=True, # normalise histogram
)
dx_2 = 0.5 * (bins[1] - bins[0]) # bin half spacing (we know this is 1/2, but lets calculate it in case you change it!)
# plot
fig, ax = plt.subplots()
ax.plot(
bins[:-1] + dx_2,
counts,
marker="s", # square marker like the plot from your other question
)
ax.set_xlabel("Diameter")
ax.set_ylabel("Probability density")
fig.show()
which gives:

How to plot columns from a dataframe, as subplots

What am I doing wrong here? I want to create for new dataframe from df and use Dates as the x-axis in a line chart for each newly created dataframe (Emins, FTSE, Stoxx and Nikkei).
I have a dataframe called df that I created from data.xlsx and it looks like this:
Dates ES1 Z 1 VG1 NK1
0 2005-01-04 -0.0126 0.0077 -0.0030 0.0052
1 2005-01-05 -0.0065 -0.0057 0.0007 -0.0095
2 2005-01-06 0.0042 0.0017 0.0051 0.0044
3 2005-01-07 -0.0017 0.0061 0.0010 -0.0009
4 2005-01-11 -0.0065 -0.0040 -0.0147 0.0070
3670 2020-09-16 -0.0046 -0.0065 -0.0003 -0.0009
3671 2020-09-17 -0.0083 -0.0034 -0.0039 -0.0086
3672 2020-09-18 -0.0024 -0.0009 -0.0009 0.0052
3673 2020-09-23 -0.0206 0.0102 0.0022 -0.0013
3674 2020-09-24 0.0021 -0.0136 -0.0073 -0.0116
From df I created 4 new dataframes called Eminis, FTSE, Stoxx and Nikkei.
Thanks for your help!!!!
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('classic')
df = pd.read_excel('data.xlsx')
df = df.rename(columns={'Dates':'Date','ES1': 'Eminis', 'Z 1': 'FTSE','VG1': 'Stoxx','NK1': 'Nikkei','TY1': 'Notes','G 1': 'Gilts', 'RX1': 'Bunds','JB1': 'JGBS','CL1': 'Oil','HG1': 'Copper','S 1': 'Soybeans','GC1': 'Gold','WILLTIPS': 'TIPS'})
headers = df.columns
Eminis = df[['Date','Eminis']]
FTSE = df[['Date','FTSE']]
Stoxx = df[['Date','Stoxx']]
Nikkei = df[['Date','Nikkei']]
# create multiple plots via plt.subplots(rows,columns)
fig, axes = plt.subplots(2,2, figsize=(20,15))
x = Date
y1 = Eminis
y2 = Notes
y3 = Stoxx
y4 = Nikkei
# one plot on each subplot
axes[0][0].line(x,y1)
axes[0][1].line(x,y2)
axes[1][0].line(x,y3)
axes[1][1].line(x,y4)
plt.legends()
plt.show()
As elegant solution is to:
Set Dates column in your DataFrame as the index.
Create a figure with the required number of subplots
(in your case 4), calling plt.subplots.
Draw a plot from your DataFrame, passing:
ax - the ax result from subplots (here it is an array of Axes
objects, not a single Axes),
subplots=True - to draw each column in a separate
subplot.
The code to do it is:
fig, a = plt.subplots(2, 2, figsize=(12, 6), tight_layout=True)
df.plot(ax=a, subplots=True, rot=60);
To test the above code I created the following DataFrame:
np.random.seed(1)
ind = pd.date_range('2005-01-01', '2006-12-31', freq='7D')
df = pd.DataFrame(np.random.rand(ind.size, 4),
index=ind, columns=['ES1', 'Z 1', 'VG1', 'NK1'])
and got the following picture:
As my test data are random, I assumed "7 days" frequency, to
have the picture not much "cluttered".
In the case of your real data, consider e.g. resampling with
e.g. also '7D' frequency and mean() aggregation function.
I think the more succinct option is not to make many dataframes, which creates unnecessary work, and complexity.
Plotting data is about shaping the dataframe for the plot API
In this case, a better option is to convert the dataframe to a long (tidy) format, from a wide format, using .stack.
This places all the labels in one column, and the values in another column
Use seaborn.relplot, which can create a FacetGrid from a dataframe in a long format.
seaborn is a high-level API for matplotlib, and makes plotting much easier.
If the dataframe contains many stocks, but only a few are to be plotted, they can be selected with Boolean indexing
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# import data from excel, or setup test dataframe
data = {'Dates': ['2005-01-04', '2005-01-05', '2005-01-06', '2005-01-07', '2005-01-11', '2020-09-16', '2020-09-17', '2020-09-18', '2020-09-23', '2020-09-24'],
'ES1': [-0.0126, -0.0065, 0.0042, -0.0017, -0.0065, -0.0046, -0.0083, -0.0024, -0.0206, 0.0021],
'Z 1': [0.0077, -0.0057, 0.0017, 0.0061, -0.004, -0.0065, -0.0034, -0.0009, 0.0102, -0.0136],
'VG1': [-0.003, 0.0007, 0.0051, 0.001, -0.0147, -0.0003, -0.0039, -0.0009, 0.0022, -0.0073],
'NK1': [0.0052, -0.0095, 0.0044, -0.0009, 0.007, -0.0009, -0.0086, 0.0052, -0.0013, -0.0116]}
df = pd.DataFrame(data)
# rename columns
df = df.rename(columns={'Dates':'Date','ES1': 'Eminis', 'Z 1': 'FTSE','VG1': 'Stoxx','NK1': 'Nikkei'})
# set Date to a datetime
df.Date = pd.to_datetime(df.Date)
# set Date as the index
df.set_index('Date', inplace=True)
# stack the dataframe
dfs = df.stack().reset_index().rename(columns={'level_1': 'Stock', 0: 'val'})
# to select only a subset of values from Stock, to plot, select them with Boolean indexing
df_select = dfs[dfs.Stock.isin(['Eminis', 'FTSE', 'Stoxx', 'Nikkei'])]`
# df_select.head()
Date Stock val
0 2005-01-04 Eminis -0.0126
1 2005-01-04 FTSE 0.0077
2 2005-01-04 Stoxx -0.0030
3 2005-01-04 Nikkei 0.0052
4 2005-01-05 Eminis -0.0065
# plot
sns.relplot(data=df_select, x='Date', y='val', col='Stock', col_wrap=2, kind='line')
What am I doing wrong here?
The current implementation is inefficient, has a number of incorrect method calls, and undefined variables.
Date is not defined for x = Date
y2 = Notes: Notes is not defined
.line is not a plt method and causes an AttributeError; it should be plt.plot
y1 - y4 are DataFrames, but passed to the plot method for the y-axis, which causes TypeError: unhashable type: 'numpy.ndarray'; one column should be passes as y.
.legends is not a method; it's .legend
The legend must be shown for each subplot, if one is desired.
Eminis = df[['Date','Eminis']]
FTSE = df[['Date','FTSE']]
Stoxx = df[['Date','Stoxx']]
Nikkei = df[['Date','Nikkei']]
# create multiple plots via plt.subplots(rows,columns)
fig, axes = plt.subplots(2,2, figsize=(20,15))
x = df.Date
y1 = Eminis.Eminis
y2 = FTSE.FTSE
y3 = Stoxx.Stoxx
y4 = Nikkei.Nikkei
# one plot on each subplot
axes[0][0].plot(x,y1, label='Eminis')
axes[0][0].legend()
axes[0][1].plot(x,y2, label='FTSE')
axes[0][1].legend()
axes[1][0].plot(x,y3, label='Stoxx')
axes[1][0].legend()
axes[1][1].plot(x,y4, label='Nikkei')
axes[1][1].legend()
plt.show()

Fill NaN values in a column within a specific range of values

I am wanting to do the following:
Fill NaN values in a single column using values within a specific range.
The range I am wanting to use is the mean of the non-Nan values in the column +/- 1 one standard
deviation of the computed mean.
NOTE If possible, I would like to be able to use multiples of the std dev by simply multiplying it by
a constant.
I thought I had it (see full code below) but the output from print(df['C'].describe()) shows that
I am generating values well outside my desired range. In fact, I am generating numbers outside
the original min and max of the column, which is definitely not what I want.
import pandas as pd
import numpy as np
import sys
print('Python: {}'.format(sys.version))
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('\033[1;31m' + '--------------' + '\033[0m') # Bold red
display_settings = {
'max_columns': 15,
'max_colwidth': 60,
'expand_frame_repr': False, # Wrap to multiple pages
'max_rows': 50,
'precision': 6,
'show_dimensions': False
}
# pd.options.display.float_format = '{:,.2f}'.format
for op, value in display_settings.items():
pd.set_option("display.{}".format(op), value)
df = pd.DataFrame(np.random.randint(0, 1000, size=(200, 10)), columns=list('ABCDEFGHIJ'))
# df = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list(['AA','BB','C2','D2']))
print(df, '\n')
# https://stackoverflow.com/questions/55149738/pandas-replace-values-with-nan-at-random
df['C'] = df['C'].sample(frac=0.65) # The percentage of non-NaN values.
df['H'] = df['H'].sample(frac=0.75) # The percentage of non-NaN values.
print(df, '\n')
print(df.isnull().sum(), '\n')
print(df['C'].describe(), '\n')
def fillNaN_with_unifrand(col):
a = col.values
m = np.isnan(a) # mask of NaNs
mu, sigma = col.mean(), col.std()
a[m] = np.random.normal(mu, sigma, size=m.sum())
return col
# https://stackoverflow.com/questions/46543060/how-to-replace-every-nan-in-a-column-with-different-random-values-using-pandas?rq=1
fillNaN_with_unifrand(df['C'])
pd.options.display.float_format = '{:.0f}'.format
print(df, '\n')
print(df.isnull().sum(), '\n')
print(df['C'].describe())
Output of print(df['C'].describe()):
Starting:
count 130.000000
mean 462.446154
std 290.760432
min 7.000000
25% 187.500000
50% 433.000000
75% 671.250000
max 992.000000
Name: C, dtype: float64
Ending:
count 200
mean 517
std 298
min -187
25% 281
50% 544
75% 763
max 1218
Name: C, dtype: float64
Note the min and max. All of my fill values (in this instance) should have been 462 +/- 290.
Well, this is not how statistics work. A Gaussian Normal Distribution has a mean and a std but values can be drawn far away from mean +- std, they are just less likeley. As per definition of a normal distribution, 68 % of all values are within +- 1*std, 95 % are within +-2*std and so on. The question is: What do you want to do with outliers? Set them to mean +- std or draw again?
Case 1: Set outliers to min/max
This is usually unwanted, as this changes your distribution and puts more weight on the lower and upper boundary.
from matplotlib import pyplot as plt
mu = 100
sigma = 7
a = np.random.normal(mu, sigma, size=2000) # I used a size of 2000 as an example
a[a<(mu-sigma)] = mu-sigma
a[a>(mu+sigma)] = mu+sigma
plt.hist(a, bins=12, edgecolor='black')
plt.show()
Case 2: Truncated Normal Distribution
What you usually want is the Truncated Normal Distribution. It creates a distribution with an upper and a lower boundary. You find this function at the scipy.stats module. It works a bit different though: you first create the distribution by normalizing the lower and upper clip and then you create a numer of random variates rvs from it like this:
from matplotlib import pyplot as plt
import scipy.stats as stats
mu = 100
sigma = 7
lower_clip = mu-sigma
upper_clip = mu+sigma
a = stats.truncnorm((lower_clip - mu) / sigma, (upper_clip - mu) / sigma, loc=mu, scale=sigma)
plt.hist(a.rvs(2000), bins=12, edgecolor='black')
plt.show()
The constant of multiples of sigma is easily implemented. You can just change your lower and upper clip like
lower_clip = mu-x*sigma
with x being your constant.

How to plot multi-index, categorical data?

Given the following data:
DC,Mode,Mod,Ven,TY1,TY2,TY3,TY4,TY5,TY6,TY7,TY8
Intra,S,Dir,C1,False,False,False,False,False,True,True,False
Intra,S,Co,C1,False,False,False,False,False,False,False,False
Intra,M,Dir,C1,False,False,False,False,False,False,True,False
Inter,S,Co,C1,False,False,False,False,False,False,False,False
Intra,S,Dir,C2,False,True,True,True,True,True,True,False
Intra,S,Co,C2,False,False,False,False,False,False,False,False
Intra,M,Dir,C2,False,False,False,False,False,False,False,False
Inter,S,Co,C2,False,False,False,False,False,False,False,False
Intra,S,Dir,C3,False,False,False,False,True,True,False,False
Intra,S,Co,C3,False,False,False,False,False,False,False,False
Intra,M,Dir,C3,False,False,False,False,False,False,False,False
Inter,S,Co,C3,False,False,False,False,False,False,False,False
Intra,S,Dir,C4,False,False,False,False,False,True,False,True
Intra,S,Co,C4,True,True,True,True,False,True,False,True
Intra,M,Dir,C4,False,False,False,False,False,True,False,True
Inter,S,Co,C4,True,True,True,False,False,True,False,True
Intra,S,Dir,C5,True,True,False,False,False,False,False,False
Intra,S,Co,C5,False,False,False,False,False,False,False,False
Intra,M,Dir,C5,True,True,False,False,False,False,False,False
Inter,S,Co,C5,False,False,False,False,False,False,False,False
Imports:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
To reproduce my DataFrame, copy the data then use:
df = pd.read_clipboard(sep=',')
I'd like to create a plot conveying the same information as my example, but not necessarily with the same shape (I'm open to suggestions). I'd also like to hover over the color and have the appropriate Ven displayed (e.g. C1, not 1).:
Edit 2018-10-17:
The two solutions provided so far, are helpful and each accomplish a different aspect of what I'm looking for. However, the key issue I'd like to resolve, which wasn't explicitly stated prior to this edit, is the following:
I would like to perform the plotting without converting Ven to an int; this numeric transformation isn't practical with the real data. So the actual scope of the question is to plot all categorical data with two categorical axes.
The issue I'm experiencing is the data is categorical and the y-axis is multi-indexed.
I've done the following to transform the DataFrame:
# replace False witn nan
df = df.replace(False, np.nan)
# replace True with a number representing Ven (e.g. C1 = 1)
def rep_ven(row):
return row.iloc[4:].replace(True, int(row.Ven[1]))
df.iloc[:, 4:] = df.apply(rep_ven, axis=1)
# drop the Ven column
df = df.drop(columns=['Ven'])
# set multi-index
df_m = df.set_index(['DC', 'Mode', 'Mod'])
Plotting the transformed DataFrame produces:
plt.figure(figsize=(20,10))
heatmap = plt.imshow(df_m)
plt.xticks(range(len(df_m.columns.values)), df_m.columns.values)
plt.yticks(range(len(df_m.index)), df_m.index)
plt.show()
This plot isn't very streamlined, there are four axis values for each Ven. This is a subset of data, so the graph would be very long with all the data.
Here's my solution. Instead of plotting I just apply a style to the DataFrame, see https://pandas.pydata.org/pandas-docs/stable/style.html
# Transform Ven values from "C1", "C2" to 1, 2, ..
df['Ven'] = df['Ven'].str[1]
# Given a specific combination of dc, mode, mod, ven,
# do we have any True cells?
g = df.groupby(['DC', 'Mode', 'Mod', 'Ven']).any()
# Let's drop any rows with only False values
g = g[g.any(axis=1)]
# Convert True, False to 1, 0
g = g.astype(int)
# Get the values of the ven index as an int array
# Note: we don't want to drop the ven index!!
# Otherwise styling won't work
ven = g.index.get_level_values('Ven').values.astype(int)
# Multiply 1 and 0 with Ven value
g = g.mul(ven, axis=0)
# Sort the index
g.sort_index(ascending=False, inplace=True)
# Now display the dataframe with styling
# first we get a color map
import matplotlib
cmap = matplotlib.cm.get_cmap('tab10')
def apply_color_map(val):
# hide the 0 values
if val == 0:
return 'color: white; background-color: white'
else:
# for non-zero: get color from cmap, convert to hexcode for css
s = "color:white; background-color: " + matplotlib.colors.rgb2hex(cmap(val))
return s
g
g.style.applymap(apply_color_map)
The available matplotlib colormaps can be seen here: Colormap reference, with some additional explanation here: Choosing a colormap
Explanation: Remove rows where TY1-TY8 are all nan to create your plot. Refer to this answer as a starting point for creating interactive annotations to display Ven.
The below code should work:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_clipboard(sep=',')
# replace False witn nan
df = df.replace(False, np.nan)
# replace True with a number representing Ven (e.g. C1 = 1)
def rep_ven(row):
return row.iloc[4:].replace(True, int(row.Ven[1]))
df.iloc[:, 4:] = df.apply(rep_ven, axis=1)
# drop the Ven column
df = df.drop(columns=['Ven'])
idx = df[['TY1','TY2', 'TY3', 'TY4','TY5','TY6','TY7','TY8']].dropna(thresh=1).index.values
df = df.loc[idx,:].sort_values(by=['DC', 'Mode','Mod'], ascending=False)
# set multi-index
df_m = df.set_index(['DC', 'Mode', 'Mod'])
plt.figure(figsize=(20,10))
heatmap = plt.imshow(df_m)
plt.xticks(range(len(df_m.columns.values)), df_m.columns.values)
plt.yticks(range(len(df_m.index)), df_m.index)
plt.show()

day of the week as X in Seaborn plot

I have a dataset with clicks and impressions, I aggregated them by the day of week using groupby and agg
df2=df.groupby('day_of_week',as_index=False, sort=True, group_keys=True).agg({'Clicks':'sum','Impressions':'sum'})
Then I was trying to plot them out using subplot
f, axes = plt.subplots(2, 2, figsize=(7, 7))
sns.countplot(data=df2['Clicks'], x=df2['day_of_week'],ax=axes[0, 0])
sns.countplot(data=df2["Impressions"], x=df2['day_of_week'],ax=axes[0, 1])
but instead using the day of week as the X, the plot used values in clicks and impressions instead. Is there a way to force the X to day of the week while value is in Y instead? Thanks.
Full code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
df=pd.read_csv('data/data_clean.csv')
df2=df.groupby('day_of_week',as_index=False, sort=True, group_keys=True).agg({'Clicks':'sum','Impressions':'sum'})
f, axes = plt.subplots(2, 2, figsize=(7, 7))
sns.countplot(data=df2['Clicks'], x=df2['day_of_week'],ax=axes[0, 0])
sns.countplot(data=df2["Impressions"], x=df2['day_of_week'],ax=axes[0, 1])
plt.show()
Fake Data:
day_of_week,Clicks,Impressions
0 100 2000
1 400 4000
2 300 3500
3 200 2000
4 100 1000
5 50 500
6 10 150
I was able to find the answer with seaborn with Peter's guidance.
The correct plotting code is
sns.barplot( x=df2['day_of_week'],y=df2['Clicks'] , color="skyblue", ax=axes[0, 0])
sns.barplot( x=df2['day_of_week'],y=df2['Impressions'] , color="olive", ax=axes[0, 1])
It seems seaborn by default would take the first variable as X instead of Y.
Based on the seaborn docs, I think countplot expects a long-form dataframe such as your df, not the pre-aggregated df2 that you built and pass in your question. countplot does the counting for you.
However, your df2 is ready for a pandas bar plot:
df2.plot(kind='bar', y=['Impressions', 'Clicks'])
Result:

Resources