plotting multiple columns simultaneously in pythons - python-3.x

I have a text file with some columns. I am trying to make scatter plot from some of the columns in my file. I made a list of the items (column names) that I want to make a plot for. I would like to make the scatter plot for all items in the list against other items.
expected output:
if there are 3 columns to be plotted, I would like to get these plots simultaneously:
1 vs 2
1 vs 3
2 vs 1
2 vs 3
3 vs 1
3 vs 2
to do so I made the following code in python:
import pandas as pd
import seaborn as sns
df = pd.read_csv('myfile.txt', sep="\t")
columns = list(df.columns.values)[3:] #to make a list of items
for i in len(columns):
ax = sns.lmplot(x=columns[i], y=columns[i+1], data=df)
ax.savefig(f'{columns[i]}.pdf')
but it does not return the expected outputs. do you know how to fix the code?

Related

Creating subplots through a loop from a dataframe

Case:
I receive a dataframe with (say 50) columns.
I extract the necessary columns from that dataframe using a condition.
So we have a list of selected columns of our dataframe now. (Say this variable is sel_cols)
I need a bar chart for each of these columns value_counts().
And I need to arrange all these bar charts in 3 columns, and varying number of rows based on number of columns selected in sel_cols.
So, if say 8 columns were selected, I want the figure to have 3 columns and 3 rows, with last subplot empty or just 8 subplots in 3x3 matrix if that is possible.
I could generate each chart separately using following code:
for col in sel_cols:
df[col].value_counts().plot(kind='bar)
plt.show()
plt.show() inside the loop so that each chart is shown and not just the last one.
I also tried appending these charts to a list this way:
charts = []
for col in sel_cols:
charts.append(df[col].value_counts().plot(kind='bar))
I could convert this list into an numpy array through reshape() but then it will have to be perfectly divisible into that shape. So 8 chart objects will not be reshaped into 3x3 array.
Then I tried creating the subplots first in this way:
row = len(sel_cols)//3
fig, axes = plt.subplots(nrows=row,ncols=3)
This way I would get the subplots, but I get two problems:
I end up with extra subplots in the 3 columns which will go unplotted (8 columns example).
I do not know how to plot under each subplots through a loop.
I tried this:
for row in axes:
for chart, col in zip(row,sel_cols):
chart = data[col].value_counts().plot(kind='bar')
But this only plots the last subplot with the last column. All other subplots stays blank.
How to do this with minimal lines of code, possibly without any need for human verification of the final subplots placements?
You may use this sample dataframe:
pd.DataFrame({'A':['Y','N','N','Y','Y','N','N','Y','N'],
'B':['E','E','E','E','F','F','F','F','E'],
'C':[1,1,0,0,1,1,0,0,1],
'D':['P','Q','R','S','P','Q','R','P','Q'],
'E':['E','E','E','E','F','F','G','G','G'],
'F':[1,1,0,0,1,1,0,0,1],
'G':['N','N','N','N','Y','N','N','Y','N'],
'H':['G','G','G','E','F','F','G','F','E'],
'I':[1,1,0,0,1,1,0,0,1],
'J':['Y','N','N','Y','Y','N','N','Y','N'],
'K':['E','E','E','E','F','F','F','F','E'],
'L':[1,1,0,0,1,1,0,0,1],
})
Selected columns are: sel_cols = ['A','B','D','E','G','H','J','K']
Total 8 columns.
Expected output is bar charts for value_counts() of each of these columns arranged in subplots in a figure with 3 columns. Rows to be decided based on number of columns selected, here 8 so 3 rows.
Given OP's sample data:
df = pd.DataFrame({'A':['Y','N','N','Y','Y','N','N','Y','N'],'B':['E','E','E','E','F','F','F','F','E'],'C':[1,1,0,0,1,1,0,0,1],'D':['P','Q','R','S','P','Q','R','P','Q'],'E':['E','E','E','E','F','F','G','G','G'],'F':[1,1,0,0,1,1,0,0,1],'G':['N','N','N','N','Y','N','N','Y','N'],'H':['G','G','G','E','F','F','G','F','E'],'I':[1,1,0,0,1,1,0,0,1],'J':['Y','N','N','Y','Y','N','N','Y','N'],'K':['E','E','E','E','F','F','F','F','E'],'L':[1,1,0,0,1,1,0,0,1]})
sel_cols = list('ABDEGHJK')
data = df[sel_cols].apply(pd.value_counts)
We can plot the columns of data in several ways (in order of simplicity):
DataFrame.plot with subplots param
seaborn.catplot
Loop through plt.subplots
1. DataFrame.plot with subplots param
Set subplots=True with the desired layout dimensions. Unused subplots will be auto-disabled:
data.plot.bar(subplots=True, layout=(3, 3), figsize=(8, 6),
sharex=False, sharey=True, legend=False)
plt.tight_layout()
2. seaborn.catplot
melt the data into long-form (i.e., 1 variable per column, 1 observation per row) and pass it to seaborn.catplot:
import seaborn as sns
melted = data.melt(var_name='var', value_name='count', ignore_index=False).reset_index()
sns.catplot(data=melted, kind='bar', x='index', y='count',
col='var', col_wrap=3, sharex=False)
3. Loop through plt.subplots
zip the columns and axes to iterate in pairs. Use the ax param to place each column onto its corresponding subplot.
If the grid size is larger than the number of columns (e.g., 3*3 > 8), disable the leftover axes with set_axis_off:
fig, axes = plt.subplots(3, 3, figsize=(8, 8), constrained_layout=True, sharey=True)
# plot each col onto one ax
for col, ax in zip(data.columns, axes.flat):
data[col].plot.bar(ax=ax, rot=0)
ax.set_title(col)
# disable leftover axes
for ax in axes.flat[data.columns.size:]:
ax.set_axis_off()
Alternative to the answer by tdy, I tried to do it without seaborn using Matplotlib and a for loop.
Figured it might be better for some who want specific control over subplots with formatting and other parameters, then this is another way:
fig = plt.figure(1,figsize=(16,12))
for i, col in enumerate(sel_cols,1):
fig.add_subplot(3,4,i,)
data[col].value_counts().plot(kind='bar',ax=plt.gca())
plt.title(col)
plt.tight_layout()
plt.show(1)
plt.subplot activates a subplot, while plt.gca() points to the active subplot.

How do I transpose a Dataframe and how to scatter plot the transposed df

I have this dataframe with 20 countries and 20 years of data
Country 2000 2001 2002 ...
USA 1 2 3
CANADA 4 5 6
SWEDEN 7 8 9
...
and I want to get a new df to create a scatter plot with y = value for each column (country) and x= Year
Country USA CANADA SWEDEN ...
2000 1 4 7
2001 2 5 8
2002 3 6 9
...
My Code :
data = pd.read_csv("data.csv")
data.set_index("Country Name", inplace = True)
data_transposed = data.T
I'm struggling to create this kind of scatter plot.
Any idea ?
Thanks
Scatter is a plot which receives x and y only, you can scatter the whole dataframe directly. However, a small workaround:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(data={"Country":["USA", "Canada", "Brasil"], 2000:[1,4,7], 2001:[3,7,9], 2002: [2,8,5]})
for column in df.columns:
if column != "Country":
plt.scatter(x=df["Country"], y=df[column])
plt.show()
result:
It just plotting each column separately, eventually you get what you want.
As you see, each year is represent by different colors - you can do the opposite (plotting years and having countries as different colors). Scatter is 1x1: you have Country, Year, Value. You can present only two of them in a scatter plot (unless you use colors for example)
You need to transpose your dataframe for that (as you specify yourself what x and y are) but you can do it with df.transpose(): see documentation.
Notice in my df, country column is not an index. You can use set_index or reset_index to control it.

How to create a scatter plot where values are across multiple columns?

I have a dataframe in Pandas in which the rows are observations at different times and each column is a size bin where the values represent the number of particles observed for that size bin. So it looks like the following:
bin1 bin2 bin3 bin4 bin5
Time1 50 200 30 40 5
Time2 60 60 40 420 700
Time3 34 200 30 67 43
I would like to use plotly/cufflinks to create a scatterplot in which the x axis will be each size bin, and the y axis will be the values in each size bin. There will be three colors, one for each observation.
As I'm more experienced in Matlab, I tried indexing the values using iloc (note the example below is just trying to plot one observation):
df.iplot(kind="scatter",theme="white",x=df.columns, y=df.iloc[1,:])
But I just get a key error: 0 message.
Is it possible to use indexing when choosing x and y values in Pandas?
Rather than indexing, I think you need to better understand how pandas and matplotlib interact each other.
Let's go by steps for your case:
As the pandas.DataFrame.plot documentation says, the plotted series is a column. You have the series in the row, so you need to transpose your dataframe.
To create a scatterplot, you need both x and y coordinates in different columns, but you are missing the x column, so you also need to create a column with the x values in the transposed dataframe.
Apparently pandas does not change color by default with consecutive calls to plot (matplotlib does it), so you need to pick a color map and pass a color argument, otherwise all points will have the same color.
Here a working example:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Here I copied you data in a data.txt text file and import it in pandas as a csv.
#You may have a different way to get your data.
df = pd.read_csv('data.txt', sep='\s+', engine='python')
#I assume to have a column named 'time' which is set as the index, as you show in your post.
df.set_index('time')
tdf = df.transpose() #transpose the dataframe
#Drop the time column from the trasponsed dataframe. time is not a data to be plotted.
tdf = tdf.drop('time')
#Creating x values, I go for 1 to 5 but they can be different.
tdf['xval'] = np.arange(1, len(tdf)+1)
#Choose a colormap and making a list of colors to be used.
colormap = plt.cm.rainbow
colors = [colormap(i) for i in np.linspace(0, 1, len(tdf))]
#Make an empty plot, the columns will be added to the axes in the loop.
fig, axes = plt.subplots(1, 1)
for i, cl in enumerate([datacol for datacol in tdf.columns if datacol != 'xval']):
tdf.plot(x='xval', y=cl, kind="scatter", ax=axes, color=colors[i])
plt.show()
This plots the following image:
Here a tutorial on picking colors in matplotlib.

labeling data points with dataframe including empty cells

I have an Excel sheet like this:
A B C D
3 1 2 8
4 2 2 8
5 3 2 9
2 9
6 4 2 7
Now I am trying to plot 'B' over 'C' and label the data points with the entrys of 'A'. It should show me the points 1/2, 2/2, 3/2 and 4/2 with the corresponding labels.
import matplotlib.pyplot as plt
import pandas as pd
import os
df = pd.read_excel(os.path.join(os.path.dirname(__file__), "./Datenbank/Test.xlsx"))
fig, ax = plt.subplots()
df.plot('B', 'C', kind='scatter', ax=ax)
df[['B','C','A']].apply(lambda x: ax.text(*x),axis=1);
plt.show()
Unfortunately I am getting this:
with the Error:
ValueError: posx and posy should be finite values
As you can see it did not label the last data point. I know it is because of the empty cells in the sheet but i cannot avoid them. There is just no measurement data at this positions.
I already searched for a solution here:
Annotate data points while plotting from Pandas DataFrame
but it did not solve my problem.
So, is there a way to still label the last data point?
P.S.: The excel sheet is just an example. So keep in mind in reality there are many empty cells at different positions.
You can simply trash the invalid data rows from df before plotting them
df = df[df['B'].notnull()]

Hoy can I plot a group by boxplot in pandas dropping unused categories?

Just to have the long story short. How can I plot a grouped boxplot from a category column in pandas and show only the present categories in the subset instead all posible categories.
[reproducible example]
I have a pandas dataframe with a factor column, and I want to plot a boxplot. If I plot by the factor is OK. If I do a subset and plot the boxplots by the factor, also is OK and only factors present in the subset are plotted. But if I have set the column as category, then all categories are ploted in the boxplot even if they are not present.
- Create the dataframe
import pandas as pd
import numpy as np
x = ['A']*150 + ['B']*150 + ['C']*150 + ['D']*150 + ['E']*150 + ['F']*150
y = np.random.randn(900)
z = ['X']*450 + ['Y']*450
df = pd.DataFrame({'Letter':x, 'N':y, 'type':z})
print(df.head())
print(df.tail())
- Plot by factor
df.boxplot(by='Letter')
- Plot a subset (only categories in the subset are ploted but sorted alphabetically not in the wanted order)
df[df['type']=='X'].boxplot(by='Letter')
- Convert the factor to a category and plot the subset in order to have the set ordered: All categories are plotted even if they are missing from the subset. The good part is that they are in "wanted_sort_order"
df['Letter2'] = df['Letter'].copy()
df['Letter2'] = df['Letter2'].astype('category')
# set a category in order to sort the factor in specific order
df['Letter2'].cat.set_categories(df['Letter2'].drop_duplicates().tolist()[::-1], inplace=True)
df[df['type']=='X'].boxplot(by='Letter2')
After creating the DataFrame (first code block), try the following:
df['Letter2'] = pd.Categorical(df['Letter'], list('BAC'))
df[df['type']=='X'].boxplot(by='Letter2')
Result:
What pd.Categorical is doing is just simply setting as NaN whatever isn't in the your category list (second parameter) and .boxplot() naturally just ignores that and plots only the categories you are looking for.

Resources