How to plot 2 values in pandas [duplicate] - python-3.x

I have dataframe total_year, which contains three columns (year, action, comedy).
How can I plot two columns (action and comedy) on y-axis?
My code plots only one:
total_year[-15:].plot(x='year', y='action', figsize=(10,5), grid=True)

Several column names may be provided to the y argument of the pandas plotting function. Those should be specified in a list, as follows.
df.plot(x="year", y=["action", "comedy"])
Complete example:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({"year": [1914,1915,1916,1919,1920],
"action" : [2.6,3.4,3.25,2.8,1.75],
"comedy" : [2.5,2.9,3.0,3.3,3.4] })
df.plot(x="year", y=["action", "comedy"])
plt.show()

Pandas.DataFrame.plot() per default uses index for plotting X axis, all other numeric columns will be used as Y values.
So setting year column as index will do the trick:
total_year.set_index('year').plot(figsize=(10,5), grid=True)

When using pandas.DataFrame.plot, it's only necessary to specify a column to the x parameter.
The caveat is, the rest of the columns with numeric values will be used for y.
The following code contains extra columns to demonstrate. Note, 'date' is left as a string. However, if 'date' is converted to a datetime dtype, the plot API will also plot the 'date' column on the y-axis.
If the dataframe includes many columns, some of which should not be plotted, then specify the y parameter as shown in this answer, but if the dataframe contains only columns to be plotted, then specify only the x parameter.
In cases where the index is to be used as the x-axis, then it is not necessary to specify x=.
import pandas as pd
# test data
data = {'year': [1914, 1915, 1916, 1919, 1920],
'action': [2.67, 3.43, 3.26, 2.82, 1.75],
'comedy': [2.53, 2.93, 3.02, 3.37, 3.45],
'test1': ['a', 'b', 'c', 'd', 'e'],
'date': ['1914-01-01', '1915-01-01', '1916-01-01', '1919-01-01', '1920-01-01']}
# create the dataframe
df = pd.DataFrame(data)
# display(df)
year action comedy test1 date
0 1914 2.67 2.53 a 1914-01-01
1 1915 3.43 2.93 b 1915-01-01
2 1916 3.26 3.02 c 1916-01-01
3 1919 2.82 3.37 d 1919-01-01
4 1920 1.75 3.45 e 1920-01-01
# plot the dataframe
df.plot(x='year', figsize=(10, 5), grid=True)

Related

Apply a custom rolling function with arguments on Pandas DataFrame

I have this df (here is the df.head()):
date colA
0 2018-01-05 0.6191
1 2018-01-20 0.5645
2 2018-01-25 0.5641
3 2018-01-27 0.5404
4 2018-01-30 0.4933
I would like to apply a function to every 3 rows recursively, meaning for rows: 1,2,3 then for rows: 2,3,4 then rows 3,4,5, etc.
This is what I wrote:
def my_rolling_func(df, val):
p1 = (df['date']-df['date'].min()).dt.days.tolist()[0],df[val].tolist()[0]
p2 = (df['date']-df['date'].min()).dt.days.tolist()[1],df[val].tolist()[1]
p3 = (df['date']-df['date'].min()).dt.days.tolist()[2],df[val].tolist()[2]
return sum([i*j for i,j in [p1,p2,p3]])
df.rolling(3,center=False,axis=1).apply(my_rolling_func, args=('colA'))
But I get this error:
ValueError: Length of passed values is 1, index implies 494.
494 is the number of rows in my df.
I'm not sure why it says I passed a length of 1, I thought the rolling generate slices of df according to the window size I defined (3), and then it applied the function for that subset of df.
First, you specified the wrong axis. Axis 1 means that the window will slide along the columns. You want the window to slide along the indexes, so you need to specify axis=0. Secondly, you misunderstand a little about how rolling works. It will apply your function to each column independently, so you cannot operate on both the date and colA columns at the same time inside your function.
I rewrote your code to make it work:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date':pd.date_range('2018-01-05', '2018-01-30', freq='D'), 'A': np.random.random((26,))})
df = df.set_index('date')
def my_rolling_func(s):
days = (s.index - s.index[0]).days
return sum(s*days)
res = df.rolling(3, center=False, axis=0).apply(my_rolling_func)
print(res)
Out:
A
date
2018-01-05 NaN
2018-01-06 NaN
2018-01-07 1.123872
2018-01-08 1.121119
2018-01-09 1.782860
2018-01-10 0.900717
2018-01-11 0.999509
2018-01-12 1.755408
2018-01-13 2.344914
.....

convert datetime to date python --> error: unhashable type: 'numpy.ndarray'

Pandas by default represent dates with datetime64 [ns], so I have in my columns this format [2016-02-05 00:00:00] but I just want the date 2016-02-05, so I applied this code for a few columns:
df3a['MA'] = pd.to_datetime(df3a['MA'])
df3a['BA'] = pd.to_datetime(df3a['BA'])
df3a['FF'] = pd.to_datetime(df3a['FF'])
df3a['JJ'] = pd.to_datetime(df3a['JJ'])
.....
but it gives me as result this error: TypeError: type unhashable: 'numpy.ndarray'
my question is: why i got this error and how do i convert datetime to date for multiple columns (around 50)?
i will be grateful for your help
One way to achieve what you'd like is with a DatetimeIndex. I've first created an Example DataFrame with 'date' and 'values' columns and tried from there on to reproduce the error you've got.
import pandas as pd
import numpy as np
# Example DataFrame with a DatetimeIndex (dti)
dti = pd.date_range('2020-12-01','2020-12-17') # dates from first of december up to date
values = np.random.choice(range(1, 101), len(dti)) # random values between 1 and 100
df = pd.DataFrame({'date':dti,'values':values}, index=range(len(dti)))
print(df.head())
>>> date values
0 2020-12-01 85
1 2020-12-02 100
2 2020-12-03 96
3 2020-12-04 40
4 2020-12-05 27
In the example, just the dates are already shown without the time in the 'date' column, I guess since it is a DatetimeIndex.
What I haven't tested but might can work for you is:
# Your dataframe
df3a['MA'] = pd.DatetimeIndex(df3a['MA'])
...
# automated transform for all columns (if all columns are datetimes!)
for label in df3a.columns:
df3a[label] = pd.DatetimeIndex(df3a[label])
Use DataFrame.apply:
cols = ['MA', 'BA', 'FF', 'JJ']
df3a[cols] = df3a[cols].apply(pd.to_datetime)

How to declare range based grouping in pd.Dataframe? [duplicate]

Is there an easy method in pandas to invoke groupby on a range of values increments? For instance given the example below can I bin and group column B with a 0.155 increment so that for example, the first couple of groups in column B are divided into ranges between '0 - 0.155, 0.155 - 0.31 ...`
import numpy as np
import pandas as pd
df=pd.DataFrame({'A':np.random.random(20),'B':np.random.random(20)})
A B
0 0.383493 0.250785
1 0.572949 0.139555
2 0.652391 0.401983
3 0.214145 0.696935
4 0.848551 0.516692
Alternatively I could first categorize the data by those increments into a new column and subsequently use groupby to determine any relevant statistics that may be applicable in column A?
You might be interested in pd.cut:
>>> df.groupby(pd.cut(df["B"], np.arange(0, 1.0+0.155, 0.155))).sum()
A B
B
(0, 0.155] 2.775458 0.246394
(0.155, 0.31] 1.123989 0.471618
(0.31, 0.465] 2.051814 1.882763
(0.465, 0.62] 2.277960 1.528492
(0.62, 0.775] 1.577419 2.810723
(0.775, 0.93] 0.535100 1.694955
(0.93, 1.085] NaN NaN
[7 rows x 2 columns]
Try this:
df = df.sort_values('B')
bins = np.arange(0, 1.0, 0.155)
ind = np.digitize(df['B'], bins)
print df.groupby(ind).head()
Of course you can use any function on the groups not just head.
so this is how I use the groupby function
df1=data
bins = [0,40,50,60,70,100]
group_names=['F','S','C','B','A']
df1['grade']=pd.cut(data['student_mark'],bins,labels=group_names)
df1

How to create a scatter plot where values are across multiple columns?

I have a dataframe in Pandas in which the rows are observations at different times and each column is a size bin where the values represent the number of particles observed for that size bin. So it looks like the following:
bin1 bin2 bin3 bin4 bin5
Time1 50 200 30 40 5
Time2 60 60 40 420 700
Time3 34 200 30 67 43
I would like to use plotly/cufflinks to create a scatterplot in which the x axis will be each size bin, and the y axis will be the values in each size bin. There will be three colors, one for each observation.
As I'm more experienced in Matlab, I tried indexing the values using iloc (note the example below is just trying to plot one observation):
df.iplot(kind="scatter",theme="white",x=df.columns, y=df.iloc[1,:])
But I just get a key error: 0 message.
Is it possible to use indexing when choosing x and y values in Pandas?
Rather than indexing, I think you need to better understand how pandas and matplotlib interact each other.
Let's go by steps for your case:
As the pandas.DataFrame.plot documentation says, the plotted series is a column. You have the series in the row, so you need to transpose your dataframe.
To create a scatterplot, you need both x and y coordinates in different columns, but you are missing the x column, so you also need to create a column with the x values in the transposed dataframe.
Apparently pandas does not change color by default with consecutive calls to plot (matplotlib does it), so you need to pick a color map and pass a color argument, otherwise all points will have the same color.
Here a working example:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Here I copied you data in a data.txt text file and import it in pandas as a csv.
#You may have a different way to get your data.
df = pd.read_csv('data.txt', sep='\s+', engine='python')
#I assume to have a column named 'time' which is set as the index, as you show in your post.
df.set_index('time')
tdf = df.transpose() #transpose the dataframe
#Drop the time column from the trasponsed dataframe. time is not a data to be plotted.
tdf = tdf.drop('time')
#Creating x values, I go for 1 to 5 but they can be different.
tdf['xval'] = np.arange(1, len(tdf)+1)
#Choose a colormap and making a list of colors to be used.
colormap = plt.cm.rainbow
colors = [colormap(i) for i in np.linspace(0, 1, len(tdf))]
#Make an empty plot, the columns will be added to the axes in the loop.
fig, axes = plt.subplots(1, 1)
for i, cl in enumerate([datacol for datacol in tdf.columns if datacol != 'xval']):
tdf.plot(x='xval', y=cl, kind="scatter", ax=axes, color=colors[i])
plt.show()
This plots the following image:
Here a tutorial on picking colors in matplotlib.

labeling data points with dataframe including empty cells

I have an Excel sheet like this:
A B C D
3 1 2 8
4 2 2 8
5 3 2 9
2 9
6 4 2 7
Now I am trying to plot 'B' over 'C' and label the data points with the entrys of 'A'. It should show me the points 1/2, 2/2, 3/2 and 4/2 with the corresponding labels.
import matplotlib.pyplot as plt
import pandas as pd
import os
df = pd.read_excel(os.path.join(os.path.dirname(__file__), "./Datenbank/Test.xlsx"))
fig, ax = plt.subplots()
df.plot('B', 'C', kind='scatter', ax=ax)
df[['B','C','A']].apply(lambda x: ax.text(*x),axis=1);
plt.show()
Unfortunately I am getting this:
with the Error:
ValueError: posx and posy should be finite values
As you can see it did not label the last data point. I know it is because of the empty cells in the sheet but i cannot avoid them. There is just no measurement data at this positions.
I already searched for a solution here:
Annotate data points while plotting from Pandas DataFrame
but it did not solve my problem.
So, is there a way to still label the last data point?
P.S.: The excel sheet is just an example. So keep in mind in reality there are many empty cells at different positions.
You can simply trash the invalid data rows from df before plotting them
df = df[df['B'].notnull()]

Resources