How do I transpose a Dataframe and how to scatter plot the transposed df - python-3.x

I have this dataframe with 20 countries and 20 years of data
Country 2000 2001 2002 ...
USA 1 2 3
CANADA 4 5 6
SWEDEN 7 8 9
...
and I want to get a new df to create a scatter plot with y = value for each column (country) and x= Year
Country USA CANADA SWEDEN ...
2000 1 4 7
2001 2 5 8
2002 3 6 9
...
My Code :
data = pd.read_csv("data.csv")
data.set_index("Country Name", inplace = True)
data_transposed = data.T
I'm struggling to create this kind of scatter plot.
Any idea ?
Thanks

Scatter is a plot which receives x and y only, you can scatter the whole dataframe directly. However, a small workaround:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(data={"Country":["USA", "Canada", "Brasil"], 2000:[1,4,7], 2001:[3,7,9], 2002: [2,8,5]})
for column in df.columns:
if column != "Country":
plt.scatter(x=df["Country"], y=df[column])
plt.show()
result:
It just plotting each column separately, eventually you get what you want.
As you see, each year is represent by different colors - you can do the opposite (plotting years and having countries as different colors). Scatter is 1x1: you have Country, Year, Value. You can present only two of them in a scatter plot (unless you use colors for example)
You need to transpose your dataframe for that (as you specify yourself what x and y are) but you can do it with df.transpose(): see documentation.
Notice in my df, country column is not an index. You can use set_index or reset_index to control it.

Related

How to plot 2 values in pandas [duplicate]

I have dataframe total_year, which contains three columns (year, action, comedy).
How can I plot two columns (action and comedy) on y-axis?
My code plots only one:
total_year[-15:].plot(x='year', y='action', figsize=(10,5), grid=True)
Several column names may be provided to the y argument of the pandas plotting function. Those should be specified in a list, as follows.
df.plot(x="year", y=["action", "comedy"])
Complete example:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({"year": [1914,1915,1916,1919,1920],
"action" : [2.6,3.4,3.25,2.8,1.75],
"comedy" : [2.5,2.9,3.0,3.3,3.4] })
df.plot(x="year", y=["action", "comedy"])
plt.show()
Pandas.DataFrame.plot() per default uses index for plotting X axis, all other numeric columns will be used as Y values.
So setting year column as index will do the trick:
total_year.set_index('year').plot(figsize=(10,5), grid=True)
When using pandas.DataFrame.plot, it's only necessary to specify a column to the x parameter.
The caveat is, the rest of the columns with numeric values will be used for y.
The following code contains extra columns to demonstrate. Note, 'date' is left as a string. However, if 'date' is converted to a datetime dtype, the plot API will also plot the 'date' column on the y-axis.
If the dataframe includes many columns, some of which should not be plotted, then specify the y parameter as shown in this answer, but if the dataframe contains only columns to be plotted, then specify only the x parameter.
In cases where the index is to be used as the x-axis, then it is not necessary to specify x=.
import pandas as pd
# test data
data = {'year': [1914, 1915, 1916, 1919, 1920],
'action': [2.67, 3.43, 3.26, 2.82, 1.75],
'comedy': [2.53, 2.93, 3.02, 3.37, 3.45],
'test1': ['a', 'b', 'c', 'd', 'e'],
'date': ['1914-01-01', '1915-01-01', '1916-01-01', '1919-01-01', '1920-01-01']}
# create the dataframe
df = pd.DataFrame(data)
# display(df)
year action comedy test1 date
0 1914 2.67 2.53 a 1914-01-01
1 1915 3.43 2.93 b 1915-01-01
2 1916 3.26 3.02 c 1916-01-01
3 1919 2.82 3.37 d 1919-01-01
4 1920 1.75 3.45 e 1920-01-01
# plot the dataframe
df.plot(x='year', figsize=(10, 5), grid=True)

Making rows of points in dataframe into POLYGON using geopandas

I am trying to finding out if points are in closed polygons (in this question: Finding if a point in a dataframe is in a polygon and assigning polygon name to point) but I realized that there might be another way to do this:
I have this dataframe
df=
id x_zone y_zone
0 A1 65.422080 48.147850
1 A1 46.635708 51.165745
2 A1 46.597984 47.657444
3 A1 68.477700 44.073700
4 A3 46.635708 54.108190
5 A3 46.635708 51.844770
6 A3 63.309560 48.826878
7 A3 62.215572 54.108190
and I would like to transform this into
id Polygon
0 A1 POLYGON((65.422080, 48.147850), (46.635708, 51.165745), (46.597984, 47.657444), (68.477700, 44.073700))
1 A3 POLYGON((46.635708,54.108190), (46.635708 ,51.844770), (63.309560, 48.826878),(62.215572 , 54.108190))
and do the same for points:
df1=
item x y
0 1 50 49
1 2 60 53
2 3 70 30
to
item point
0 1 POINT(50,49)
1 2 POINT(60,53)
2 3 POINT(70,30)
I have never used geopandas and am a little at a loss here.
My question is thus: How do I get from a pandas dataframe to a dataframe with geopandas attributes?
Thankful for any insight!
You can achieve as follows but you would have to set the right dtype. I know in ArcGIS you have to set the dtype as geometry;
df.groupby('id').apply(lambda x: 'POLYGON(' + str(tuple(zip(x['x_zone'],x['y_zone'])))+')')
I'd suggest the following to directly get a GeoDataFrame from your df:
from shapely.geometry import Polygon
import geopandas as gpd
gdf = gpd.GeoDataFrame(geometry=df.groupby('name').apply(
lambda g: Polygon(gpd.points_from_xy(g['x_zone'], g['y_zone']))))
It first creates a list of points using geopandas' points_from_xy, then create a Polygon object from this list.

Plot from csv with panda grouping

If I have a csv with 4 columns:
how can I average the values of one column (x) over the average of another column (y) by grouping through the first one with panda? I have to do a loop for every value of the first column? I am not sure about the implementation.
For example, if I have a csv file:
a,1,2,4
a,2,2,5
a,3,2,6
a,4,2,5
b,1,3,2
b,2,3,3
b,3,3,4
and I want a plot with a,average(3rd column) and b,average(3rd column)
I have to do something like:
df=pd.reas_csv
x=group_by("values of the 1st column").average()
I would also try to plot kde over the 2nd column, which has ten rows for every group of the first column.
I don't understand how to group data from *csv file without a header in particular.
Thank you for the help.
Assume your dataframe looks like
print(df)
0 1 2 3
0 a 1 2 4
1 a 2 2 5
2 a 3 2 6
3 a 4 2 5
4 b 1 3 2
5 b 2 3 3
6 b 3 3 4
If you want to plot with a average of 3rd column and b average of 3rd column, you can do
import pandas as pd
import matplotlib.pyplot as plt
df.groupby(0).mean()[3].plot.bar(rot=0)
plt.show()

plotting multiple columns simultaneously in pythons

I have a text file with some columns. I am trying to make scatter plot from some of the columns in my file. I made a list of the items (column names) that I want to make a plot for. I would like to make the scatter plot for all items in the list against other items.
expected output:
if there are 3 columns to be plotted, I would like to get these plots simultaneously:
1 vs 2
1 vs 3
2 vs 1
2 vs 3
3 vs 1
3 vs 2
to do so I made the following code in python:
import pandas as pd
import seaborn as sns
df = pd.read_csv('myfile.txt', sep="\t")
columns = list(df.columns.values)[3:] #to make a list of items
for i in len(columns):
ax = sns.lmplot(x=columns[i], y=columns[i+1], data=df)
ax.savefig(f'{columns[i]}.pdf')
but it does not return the expected outputs. do you know how to fix the code?

labeling data points with dataframe including empty cells

I have an Excel sheet like this:
A B C D
3 1 2 8
4 2 2 8
5 3 2 9
2 9
6 4 2 7
Now I am trying to plot 'B' over 'C' and label the data points with the entrys of 'A'. It should show me the points 1/2, 2/2, 3/2 and 4/2 with the corresponding labels.
import matplotlib.pyplot as plt
import pandas as pd
import os
df = pd.read_excel(os.path.join(os.path.dirname(__file__), "./Datenbank/Test.xlsx"))
fig, ax = plt.subplots()
df.plot('B', 'C', kind='scatter', ax=ax)
df[['B','C','A']].apply(lambda x: ax.text(*x),axis=1);
plt.show()
Unfortunately I am getting this:
with the Error:
ValueError: posx and posy should be finite values
As you can see it did not label the last data point. I know it is because of the empty cells in the sheet but i cannot avoid them. There is just no measurement data at this positions.
I already searched for a solution here:
Annotate data points while plotting from Pandas DataFrame
but it did not solve my problem.
So, is there a way to still label the last data point?
P.S.: The excel sheet is just an example. So keep in mind in reality there are many empty cells at different positions.
You can simply trash the invalid data rows from df before plotting them
df = df[df['B'].notnull()]

Resources