How to group-by twice, preserve original columns, and plot

How to group-by twice, preserve original columns, and plot - python-3.x

I have the following data sets (only sample is shown):
I want to find the most impactful exercise per area and then plot it via Seaborn barplot.
I use the following code to do so.
# Create Dataset Using Only Area, Exercise and Impact Level Chategories
CA_data = Data[['area', 'exercise', 'impact level']]
# Compute Mean Impact Level per Exercise per Area
mean_il_CA = CA_data.groupby(['area', 'exercise'])['impact level'].mean().reset_index()
mean_il_CA_hello = mean_il_CA.groupby('area')['impact level'].max().reset_index()
# Plot
cx = sns.barplot(x="impact level", y="area", data=mean_il_CA_hello)
plt.title('Most Impactful Exercises Considering Area')
plt.show()
The resulting dataset is:
This means that when I plot, on the y axis only the label relative to the area appears, NOT 'area label' + 'exercise label' like I would like.
How do I reinsert 'exercise column into my final dataset?
How do I get both the name of the area and the exercise on the y plot?

The problem of losing the values of 'exercise' when grouping by the maximum of 'area' can be solved by keeping the MultiIndex (i.e. not using reset_index) and using .transform to create a boolean mask to select the appropriate full rows of mean_il_CA that contain the maximum 'impact_level' values per 'area'. This solution is based on the code provided in this answer by unutbu. The full labels for the bar chart can be created by concatenating the labels of 'area' and 'exercise'.
Here is an example using the titanic dataset from the seaborn package. The variables 'class', 'embark_town', and 'fare' are used in place of 'area', 'exercise', and 'impact_level'. The categorical variables both contain three unique values: 'First', 'Second', 'Third', and 'Cherbourg', 'Queenstown', 'Southampton'.
import pandas as pd # v 1.2.5
import seaborn as sns # v 0.11.1
df = sns.load_dataset('titanic')
data = df[['class', 'embark_town', 'fare']]
data.head()
data_mean = data.groupby(['class', 'embark_town'])['fare'].mean()
data_mean
# Select max values in each class and create concatenated labels
mask_max = data_mean.groupby(level=0).transform(lambda x: x == x.max())
data_mean_max = data_mean[mask_max].reset_index()
data_mean_max['class, embark_town'] = data_mean_max['class'].astype(str) + ', ' \
+ data_mean_max['embark_town']
data_mean_max
# Draw seaborn bar chart
sns.barplot(data=data_mean_max,
x=data_mean_max['fare'],
y=data_mean_max['class, embark_town'])

Related

Pandas Crosstab Plot Top N Elements

I have created a Pandas Crosstab with some data I am currently working with.
contingencyTable = pd.crosstab(index=dfBoolean['ColA'],
columns=dfBoolean['Category'],
margins=True)
Some of the categories from both the primaryDistribution and the categoricalColumnName have a TON of categories--way too many to plot nicely on a single graph. Is it possible to select the top n (say, 5, for example) categories to plot in the cross tab? Thanks!
plt.rcParams.update({'font.size': 12})
bPlot = contingencyTable.plot.bar(rot=0, figsize=(9.2, 5.7))
Here is some sample data:

Pandas Series boolean maps and plotting

I am just trying to up my understanding of plotting Pandas Series data using Booleans to mask out values I don't want. I am not sure that what I have is the correct or efficient way to do it.
Don't get me wrong, I do get the chart I am after but are my assumptions on the syntax correct?
All I want to do is plot the non zero values on my chart. I have not formatted the charts as I would normally as this was just a test of Booleans and masking data and not for creating report grade charts.
If I masked this as a Pandas DataFrame I would do the following if df1 were my DataFrame.
I understand this and it makes sense that the df1[mask] returns my values as required
# Plot our graph with only items that are non-zero
fig = px.bar(df1[mask], x = 'Animals', y = 'Count')
fig.show()
Doing it as a Pandas Series
This is the snippet that creates the graph I require
# Plot our graph with only items that are non-zero
fig = px.bar(sf, x = sf.index[sf_mask], y = sf[sf_mask])
fig.show()
After my initial test with adding my mask to sf and getting an error. I deduced that I needed to add the mask against the x and y parameters. I take it this is because a Series is just a single column and the index is set as my "animals". Therefore by mapping the sf.index[sf_mask] I get the returned animals in the index and sf[sf_mask] returns me the values. failure to add either one would give a "ValueError" stating that the arguments should have the same length.
Here is what I did to test my workings
My initial imports and setting up Plotly as my plotting backend
import pandas as pd
import plotly.express as px
# Set our plotting backend to Plotly
pd.options.plotting.backend = "plotly"
I just created a test dataset from a dictionary
animals = {'rabbits' : 1,
'dogs' : 3,
'cats' : 0,
'ferrets' : 3,
'horses' : 8,
'goldfish' : 0,
'guinea_pigs' : 2,
'hamsters' : 6,
'mice' : 3,
'rats' : 0
}
Then converted it to a pandas Series
sf = pd.Series(animals)
I then create my boolean mask to mask out all our non-Zero entries on our Pandas Series
sf_mask = sf != 0
And if I then view the mask I can see I only get non zero values which is exactly what I am looking for.
sf[sf_mask]
Which outputs my non-zero items in my series.
rabbits 1
dogs 3
ferrets 3
horses 8
guinea_pigs 2
hamsters 6
mice 3
dtype: int64
If I plot without my Boolean mask 'sf_mask' using the following syntax I get my complete Pandas Series charted
# Plot our Series showing all items
fig = px.bar(sf, x = sf.index, y = sf)
fig.show()
Which outputs the following chart
If I plot with my Boolean mask 'sf_mask' using the following syntax I get the chart I want which excludes the gaps with zero value items.
# Plot our graph with only items that are non-zero
fig = px.bar(sf, x = sf.index[sf_mask], y = sf[sf_mask])
fig.show()
Which outputs the correct chart.

Your understanding of booleans and masking is correct.
You can simplify your syntax a little though: if you take a look at the plotly.express.bar documentation, you'll see that the arguments 'x' and 'y' are optional. You don't need to pass 'x' or 'y' because by default plotly.express will create the bars using the index of the Series as x and the values of the Series as y. You can also pass the masked series in place of the entire series.
For example, this will produce the same bar chart:
fig = px.bar(sf[sf>0])
fig.update_layout(showlegend=False)

How to change scatter plot marker color in plotting loop using pandas?

I'm trying to write a simple program that reads in a CSV with various datasets (all of the same length) and automatically plots them all (as a Pandas Dataframe scatter plot) on the same figure. My current code does this well, but all the marker colors are the same (blue). I'd like to figure out how to make a colormap so that in the future, if I have much larger data sets (let's say, 100+ different X-Y pairings), it will automatically color each series as it plots. Eventually, I would like for this to be a quick and easy method to run from the command line. I did not have luck reading the documentation or stack exchange, hopefully this is not a duplicate!
I've tried the recommendations from these posts:
1)Setting different color for each series in scatter plot on matplotlib
2)https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html
3) https://matplotlib.org/users/colormaps.html
However, the first one essentially grouped the data points according to their position on the x-axis and made those groups of data the same color (not what I want, each series of data is roughly a linearly increasing function). The second and third links seemed to have worked, but I don't like the colormap choices (e.g. "viridis", many colors are too similar and it's hard to distinguish data points).
This is a simplified version of my code so far (took out other lines that automatically named axes, etc. to make it easier to read). I've also removed any attempts I've made to specify a colormap, for more of a blank canvas feel:
''' Importing multiple scatter data and plotting '''
import pandas as pd
import matplotlib.pyplot as plt
### Data file path (please enter Dataframe however you like)
path = r'/Users/.../test_data.csv'
### Read in data CSV
data = pd.read_csv(path)
### List of headers
header_list = list(data)
### Set data type to float so modified data frame can be plotted
data = data.astype(float)
### X-axis limits
xmin = 1e-4;
xmax = 3e-3;
## Create subplots to be plotted together after loop
fig, ax = plt.subplots()
### Since there are multiple X-axes (every other column), this loop only plots every other x-y column pair
for i in range(len(header_list)):
if i % 2 == 0:
dfplot = data.plot.scatter(x = "{}".format(header_list[i]), y = "{}".format(header_list[i + 1]), ax=ax)
dfplot.set_xlim(xmin,xmax) # Setting limits on X axis
plot.show()
The dataset can be found in the google drive link below. Thanks for your help!
https://drive.google.com/drive/folders/1DSEs8D7lIDUW4NIPBl2qW2EZiZxslGyM?usp=sharing

How to label line chart with column from pandas dataframe (from 3rd column values)?

I have a data set I filtered to the following (sample data):
Name Time l
1 1.129 1G-d
1 0.113 1G-a
1 3.374 1B-b
1 3.367 1B-c
1 3.374 1B-d
2 3.355 1B-e
2 3.361 1B-a
3 1.129 1G-a
I got this data after filtering the data frame and converting it to CSV file:
# Assigns the new data frame to "df" with the data from only three columns
header = ['Names','Time','l']
df = pd.DataFrame(df_2, columns = header)
# Sorts the data frame by column "Names" as integers
df.Names = df.Names.astype(int)
df = df.sort_values(by=['Names'])
# Changes the data to match format after converting it to int
df.Time=df.Time.astype(int)
df.Time = df.Time/1000
csv_file = df.to_csv(index=False, columns=header, sep=" " )
Now, I am trying to graph lines for each label column data/items with markers.
I want the column l as my line names (labels) - each as a new line, Time as my Y-axis values and Names as my X-axis values.
So, in this case, I would have 7 different lines in the graph with these labels: 1G-d, 1G-a, 1B-b, 1B-c, 1B-d, 1B-e, 1B-a.
I have done the following so far which is the additional settings, but I am not sure how to graph the lines.
plt.xlim(0, 60)
plt.ylim(0, 18)
plt.legend(loc='best')
plt.show()
I used sns.lineplot which comes with hue and I do not want to have name for the label box. Also, in that case, I cannot have the markers without adding new column for style.
I also tried ply.plot but in that case, I am not sure how to have more lines. I can only give x and y values which create only one line.
If there's any other source, please let me know below.
Thanks
The final graph I want to have is like the following but with markers:

You can apply a few tweaks to seaborn's lineplot. Using some created data since your sample isn't really long enough to demonstrate:
# Create data
np.random.seed(2019)
categories = ['1G-d', '1G-a', '1B-b', '1B-c', '1B-d', '1B-e', '1B-a']
df = pd.DataFrame({'Name':np.repeat(range(1,11), 10),
'Time':np.random.randn(100).cumsum(),
'l':np.random.choice(categories, 100)
})
# Plot
sns.lineplot(data=df, x='Name', y='Time', hue='l', style='l', dashes=False,
markers=True, ci=None, err_style=None)
# Temporarily removing limits based on sample data
#plt.xlim(0, 60)
#plt.ylim(0, 18)
# Remove seaborn legend title & set new title (if desired)
ax = plt.gca()
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles[1:], labels=labels[1:], title='New Title', loc='best')
plt.show()
To apply markers, you have to specify a style variable. This can be the same as hue.
You likely want to remove dashes, ci, and err_style
To remove the seaborn legend title, you can get the handles and labels, then re-add the legend without the first handle and label. You can also specify the location here and set a new title if desired (or just remove title=... for no title).
Edits per comments:
Filtering your data to only a subset of level categories can be done fairly easily via:
categories = ['1G-d', '1G-a', '1B-b', '1B-c', '1B-d', '1B-e', '1B-a']
df = df.loc[df['l'].isin(categories)]
markers=True will fail if there are too many levels. If you are only interested in marking points for aesthetic purposes, you can simply multiply a single marker by the number of categories you are interested in (which you have already created to filter your data to categories of interest): markers='o'*len(categories).
Alternatively, you can specify a custom dictionary to pass to the markers argument:
points = ['o', '*', 'v', '^']
mult = len(categories) // len(points) + (len(categories) % len(points) > 0)
markers = {key:value for (key, value)
in zip(categories, points * mult)}
This will return a dictionary of category-point combinations, cycling over the marker points specified until each item in categories has a point style.

How to divide the area between two co-ordinates into blocks and assign some values to those blocks?

Basically I have to create a heatmap of the crowd present in an area.
I have two coordinates. X starts from 0 and maximum is 119994. Y ranges from -14,000 to +27,000. I have to divide these coordinates into as many blocks blocks as I wish, count the number of people in each block and create a heatmap of this whole area.
Basically show the crowdedness of the area divided as blocks.
I have data in the below format:-
Employee_ID X_coord Y_coord_start Y_coord_end
23 1333 0 6000
45 3999 7000 17000
I tried dividing both the coordinate maximums by 100(to make 100 blocks) and tried finding the block coordinates but that was very complex.
As I have to make a heatmap I have to prepare a matrix of values in the form of blocks. Every block will have a count of people which I can count and find out from my data but the problem is how to make these blocks of coordinates?
I have another question regarding scatter plot:-
My data is:-
Batch_ID Pieces_Productivity
181031008780 4.578886
181031008781 2.578886
When I plot it using the following code:-
plt.scatter(list(df_books_location.Batch_ID),list(df_books_location['Pieces_productivity']), s=area, alpha=0.5)
It doesn't give me proper plot. But when I plot with small integers(0-1000) for Batch_ID I get good graph. How to handle large integers for plotting?

I don't know which of both Y_coord_-rows should give the actual Y coordinate, and also don't know whether your plot should be evaluate the data on a strict "grid", or perhaps rather smooth it out; hence I am using both an imshow() and a sns.kdeplot() in the code below:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
### generate some data
np.random.seed(0)
data = np.random.multivariate_normal([0, 0], [(1, .6), (.6, 1)], 100)
## this would e.g. be X,Y=df['X_coord'], df['Y_coord_start'] :
X,Y=data[:,0],data[:,1]
fig,ax=plt.subplots(nrows=1,ncols=3,figsize=(10,5))
ax[0].scatter(X,Y)
sns.kdeplot(X,Y, shade=True, ax=ax[1],cmap="viridis")
## the X,Y points are binned into 10x10 bins here, you will need
# to adjust the amount of bins so that it looks "nice" for you
heatmap, xedges, yedges = np.histogram2d(X, Y, bins=(10,10))
extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
im=ax[2].imshow(heatmap.T, extent=extent,
origin="lower",aspect="auto",
interpolation="nearest") ## also play with different interpolations
## Loop over heatmap dimensions and create text annotations:
# note that we need to "push" the text from the lower left corner of each pixel
# into the center of each pixel
## also try to choose a text color which is readable on all pixels,
# or e.g. use vmin=… vmax= to adjust the colormap such that the colors
# don't clash with e.g. white text
pixel_center_x=(xedges[1]-xedges[0])/2.
pixel_center_y=(yedges[1]-yedges[0])/2.
for i in range(np.shape(heatmap)[1]):
for j in range(np.shape(heatmap)[0]):
text = ax[2].text(pixel_center_x+xedges[j], pixel_center_y+yedges[i],'{0:0.0f}'.format(heatmap[j, i]),
ha="center", va="center", color="w",fontsize=6)
plt.colorbar(im)
plt.show()
yields:

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to group-by twice, preserve original columns, and plot - python-3.x

Related

Pandas Crosstab Plot Top N Elements

Pandas Series boolean maps and plotting

How to change scatter plot marker color in plotting loop using pandas?

How to label line chart with column from pandas dataframe (from 3rd column values)?

How to divide the area between two co-ordinates into blocks and assign some values to those blocks?

Categories

Resources