I have a time series from months 1 to 420 (35 years). I would like to convert to an annual series using the average of the 12 months in each year so I can put in a dataframe I have with annual datapoints. I have it setup using a range with steps of 12 but it gets kind of messy. Ideally would like to use the resample function but having trouble since no dates. Any way around this?
There's no need to resample in this case. Just use groupby with integer division to obtain the average over the years.
import numpy as np
import pandas as pd
# Sample Data
np.random.seed(123)
df = pd.DataFrame({'Months': np.arange(1,421,1),
'val': np.random.randint(1,10,420)})
# Create Yearly average. 1-12, 13-24, Subtract 1 before // to get this grouping
df.groupby((df.Months-1)//12).val.mean().reset_index().rename(columns={'Months': 'Year'})
Outputs:
Year val
0 0 3.083333
1 1 4.166667
2 2 5.250000
3 3 4.416667
4 4 5.500000
5 5 4.583333
...
31 31 5.333333
32 32 5.000000
33 33 6.250000
34 34 5.250000
Feel free to add 1 to the year column or whatever you need to make it consistent with indexing in your other annual df. Otherwise, you could just use df.groupby((df.Months+11)//12).val().mean() to get the Year to start at 1.
Related
I have a pandas DataFrame with date as the index and a column, 'spendings'. I intend to get the rolling max() of the 'spendings' column for the trailing 1 calendar month (not 30 days or 4 weeks).
I tried to capture a snippet with custom data for addressing the problem, below (borrowed from Pandas monthly rolling operation):
import pandas as pd
from io import StringIO
data = StringIO(
"""\
date spendings
20210325 15
20210405 20
20210415 10
20210425 40
20210505 3
20210515 2
20210525 2
20210527 1
"""
)
df = pd.read_csv(data,sep="\s+", parse_dates=True)
df.index = pd.to_datetime(df.date, format='%Y%m%d')
del(df['date'])
Now, to create a column 'max' to hold rolling last 1 calendar month's max() val, I use:
df['max'] = df.loc[(df.index - pd.tseries.offsets.DateOffset(months=1)):df.index, 'spendings'].max()
This raises an exception like:
TypeError: cannot do slice indexing on DatetimeIndex with these indexers [DatetimeIndex(['2021-02-25', '2021-03-05', '2021-03-15', '2021-03-25',
'2021-04-05', '2021-04-15', '2021-04-25'],
dtype='datetime64[ns]', name='date', freq=None)] of type DatetimeIndex
However, if I manually access a random month window like below, it works without exception:
>>> df['2021-04-16':'2021-05-15']
spendings
date
2021-04-25 40
2021-05-05 3
2021-05-15 2
(I could have followed the method using list comprehension here: https://stackoverflow.com/a/47199274/235415, but I would like to use panda's vectorized method. I have many DataFrames and each is very large - using list comprehension is very slow here).
Q: How to get the vectorized method of performing rolling 1 calendar month's max()?
The expected o/p, ie primarily the 'max' column (holding the max value of 'spendings' for last 1 calendar month) will be something like this:
>>> df
spendings max
date
2021-03-25 15 15
2021-04-05 20 20
2021-04-15 10 20
2021-04-25 40 40
2021-05-05 3 40
2021-05-15 2 40
2021-05-25 2 40
2021-05-27 1 3
The answer will be
[df.loc[x- pd.tseries.offsets.DateOffset(months=1):x, 'spendings'].max() for x in df.index]
Out[53]: [15, 20, 20, 40, 40, 40, 40, 3]
I have problem with "groupby" method, which works just fine in Pandas 1.1.5 and has problems in Pandas 1.3.2 (or i am doing something wrong). Please take a look at the code:
#! /usr/bin/env python3
import pandas as pd
import numpy as np
wallet = pd.DataFrame(columns=['Data','Ticker','Cost','Qty'])
if __name__=="__main__":
to_add=['17022019','pcr',10,10]
a_series = pd.Series(to_add, index = wallet.columns)
wallet = wallet.append(a_series, ignore_index=True)
to_add=['12042020','pcr',12,15]
a_series = pd.Series(to_add, index = wallet.columns)
wallet = wallet.append(a_series, ignore_index=True)
to_add=['19012021','peo',8,16]
a_series = pd.Series(to_add, index = wallet.columns)
wallet = wallet.append(a_series, ignore_index=True)
#wallet.to_csv('tmp.csv', sep = ';', encoding = "ISO-8859-2",index=False)
#wallet = pd.read_csv('tmp.csv', sep = ';', encoding = "ISO-8859-2")
#os.remove('tmp.csv')
print(wallet)
summary = wallet.groupby(['Ticker']).sum()
summary.reset_index()
print(summary)
Three lines before "print(wallet)" are in comment and lets keep them this way for a moment. Result for this code on Python 3.6.6 and Pandas 1.1.5 looks like that:
Data Ticker Cost Qty
0 17022019 pcr 10 10
1 12042020 pcr 12 15
2 19012021 peo 8 16
Data Cost Qty
Ticker
pcr 1702201912042020 22 25
peo 19012021 8 16
So, pretty OK. If i "uncomment" these three lines, result looks like this:
Data Ticker Cost Qty
0 17022019 pcr 10 10
1 12042020 pcr 12 15
2 19012021 peo 8 16
Data Cost Qty
Ticker
pcr 29064039 22 25
peo 19012021 8 16
Data in summary looks different, but it is OK. The real problem is on Python 3.9.6 and Pandas 1.3.2. With commented lines, output looks like this:
Data Ticker Cost Qty
0 17022019 pcr 10 10
1 12042020 pcr 12 15
2 19012021 peo 8 16
Empty DataFrame
Columns: []
Index: [pcr, peo]
So summary is empty dataframe with indexes only (why??? why does it not work? Is it my mistake, or error in pandas? and why there is difference between pandas 1.1.5 and 1.3.2???)
But when I uncomment these three commented lines, output looks like this:
Data Ticker Cost Qty
0 17022019 pcr 10 10
1 12042020 pcr 12 15
2 19012021 peo 8 16
Data Cost Qty
Ticker
pcr 29064039 22 25
peo 19012021 8 16
So again, as it should.
Questions are:
Why output looks differently between pandas 1.1.5 and 1.3.2?
Why if i just write dataframe to csv on hard-drive and read it again before making groupby on it it works normally, and without that i get result as empty dataframe?
Is there any other solution than writing it and reading again from hard-drive? maybe i can write it to some buffer in memory??
Am i doing something wrong with this groupby in pandas 1.3.2? This mechanism has been changed and i do not know about something?
Let's say I'm measuring the speed over time of a car moving forward on a single axis, with a new measure every 10 minutes.
I have a column in my DataFrame called delta_x, which contains how much the car moved on my axis in the last 10 minutes, values are integers only.
Now let's say that I want to aggregate my data, and have only the amount of movement over each hour, but I want to optimize my code as much as possible because my dataset is extremely large, what's the most efficient way to achieve that ?
df.head(9)
date time delta_x
0 01/01/2018 00:00 9
1 01/01/2018 00:10 9
2 01/01/2018 00:20 9
3 01/01/2018 00:30 9
4 01/01/2018 00:40 11
5 01/01/2018 00:50 12
6 01/01/2018 01:00 10
7 01/01/2018 01:10 10
8 01/01/2018 01:20 10
Currently my solution is to do the following
for file in os.listdir('temp'):
if(file.endswith('.txt'):
df = pd.read_csv(''.join(["./temp/",file]), header=None, delim_whitespace=True)
df.columns = ['date', 'time', 'delta_x']
df['hour'] = [(datetime.strptime(x, "%H:%M")).hour for x in df['time'].values]
df = df.groupby(['date','hour']).agg({'delta_x': 'sum'})
Which outputs the correct:
date hour delta_x
01/01/2018 0 59
But I was wondering, is there a better, faster and more efficient way, perhaps using NumPy ?
You can try with following packages which are used for speeding up pandas operation
https://github.com/jmcarpenter2/swifter
https://github.com/modin-project/modin
This question already has answers here:
How to groupby consecutive values in pandas DataFrame
(4 answers)
Closed 3 years ago.
So I have a DataFrame with two columns, one with label names (df['Labels']) and the other with int values (df['Volume']).
df = pd.DataFrame({'Labels':
['A','A','A','A','B','B','B','B','B','B','A','A','A','A','A','A','A','A','C','C','C','C','C'],
'Volume':[10,40,20,20,50,60,40,50,50,60,10,10,10,10,20,20,10,20,80,90,90,80,100]})
I would like to identify intervals where my labels change and then calculate the median on the column 'Volume' for each of these intervals. Later I should replace every value of column 'Volume' by the respective median of each interval.
In case of label A, I would like to have the median for both intervals.
Here is how my DataFrame should looks like:
df2 = pd.DataFrame({'Labels':['A','A','A','A','B','B','B','B','B','B','A','A','A','A','A','A','A','A','C','C','C','C','C'],
'Volume':[20,20,20,20,50,50,50,50,50,50,10,10,10,10,10,10,10,10,90,90,90,90,90]})
You want to groupby the blocks and transform median:
blocks = df['Labels'].ne(df['Labels'].shift()).cumsum()
df['group_median'] = df['Volume'].groupby(blocks).transform('median')
Use Series.cumsum + Series.shift() to create groups using groupby and then use transform
df['Volume']=df.groupby(df['Labels'].ne(df['Labels'].shift()).cumsum())['Volume'].transform('median')
print(df)
Labels Volume
0 A 20
1 A 20
2 A 20
3 A 20
4 B 50
5 B 50
6 B 50
7 B 50
8 B 50
9 B 50
10 A 10
11 A 10
12 A 10
13 A 10
14 A 10
15 A 10
16 A 10
17 A 10
18 C 90
19 C 90
20 C 90
21 C 90
22 C 90
I am trying to plot columns using Pandas running in Ipython environment with Python 3.4.3. Using the read_excel function, I try to convert an xls to a DataFrame as follows:
import matplotlib.pyplot as plt
import pandas as pd
data=pd.read_excel('/Path/to/file.xlsx',sheetname='Sheet1')
print(sup_sub)
which results in
{'Sheet1': Day a b c d
0 Monday 24 1 34.0 3
1 Tuesday 4 7 8.0 2
2 Wednesday 3 6 3.0 1
3 Thursday 2 6 4.0 0
4 Friday 1 34 -11.5 -1
5 Saturday 0 2 -21.0 -2
6 Sunday -1 4 -30.5 -3}
I know this format is incorrect as it doesn't match the formatting when a test excel file is made from scratch; the columns are not properly aligned. This also prevents me from even printing the columns using:
print(data.columns)
which returns
AttributeError: 'dict' object has no attribute 'columns'
Is there a simple way to reformat the data so columns can be referenced and graphed?
I think data is a dictionary of dataframes, with one entry per sheet of your excel file; you should be able to access the individual dataframes with data['Sheet1'].