Calculating Cumulative Average every x successive rows in Excel( not to be confused with Average every x rows gap interval) - excel

I want to calculate cumulative average every 3 rows from the value field. Above figure shows the Column cumulative average which is expected output. Tried offset method but it gives the average after every 3 rows gap interval and not the cumulative average every 3 continuous rows.

Use Series.rolling with mean and then Series.shift:
N = 3
df = pd.DataFrame({'Value': [6,9,15,3,27,33]})
df['Cum_sum'] = df['Value'].rolling(N).mean().shift(-N+1)
print (df)
Value Cum_sum
0 6 10.0
1 9 9.0
2 15 15.0
3 3 21.0
4 27 NaN
5 33 NaN

Related

how to update rows based on previous row of dataframe python

I have a time series data given below:
date product price amount
11/01/2019 A 10 20
11/02/2019 A 10 20
11/03/2019 A 25 15
11/04/2019 C 40 50
11/05/2019 C 50 60
I have a high dimensional data, and I have just added the simplified version with two columns {price, amount}. I am trying to transform it relatively based on time index illustrated below:
date product price amount
11/01/2019 A NaN NaN
11/02/2019 A 0 0
11/03/2019 A 15 -5
11/04/2019 C NaN NaN
11/05/2019 C 10 10
I am trying to get relative changes of each product based on time indexes. If previous date does not exist for a specified product, I am adding "NaN".
Can you please tell me is there any function to do this?
Group by product and use .diff()
df[["price", "amount"]] = df.groupby("product")[["price", "amount"]].diff()
output :
date product price amount
0 2019-11-01 A NaN NaN
1 2019-11-02 A 0.0 0.0
2 2019-11-03 A 15.0 -5.0
3 2019-11-04 C NaN NaN
4 2019-11-05 C 10.0 10.0

How to compare and iterate over certain rows in column while creating output as new column in dataframe?

I am wanting to backtest a trading strategy.
The data I have is OHLC (open,high,low, close) for a financial product, that is formatted into a dataframe with 300 rows (each row is 1 day) like so:
datetime O H L C
2020-03-24 1 2 3 4
2020-03-23 5 6 7 8
2020-03-22 9 1 2 3
2020-03-21 9 2 2 3
2020-03-20 9 3 2 3
2020-03-19 9 4 2 3
2020-03-18 9 5 2 3
What I want to do is, starting on the date closet to current date, in this case row with 2020-03-24:
1. take the number in column `L`
2. compare if the number in column `L` is at any point greater than the values in column `L` for the previous two days.
3. Create and fill in new column if value from 1 is greater than value in interation.
4. Repeat steps 1, 2, & 3 but take the number in column `L` that was not into included in the iteration.
Example:
1. Starting on row `2020-03-24`, take value `3`
2. Is `3` at any point greater than `7` or `2` for rows starting with `2020-03-23` and `2020-03-22`?
3. YES,assign `TRUE` to column `comparison` in df for row starting with `2020-03-24`
4. Repeat, starting on row `2020-03-21`, take value `2` in column `L`
4a. Is `2` at any point greater than values in rows `2020-03-20` or `2020-03-19`?
4b. NO, assign `FALSE` to column `comparison` in df for row starting with `2020-03-21`.
New df looks like this:
datetime O H L C Comparison
2020-03-24 1 2 3 4 TRUE
2020-03-23 5 6 7 8
2020-03-22 9 1 2 3
2020-03-21 9 2 2 3 FALSE
2020-03-20 9 3 2 3
2020-03-19 9 4 2 3
2020-03-18 9 5 2 3
The only way I know how to do this is with a FOR loop, but that doesnt work on iterating and comparing only certain subsets like so:
for i in df['L']:
if df['L'] >
You need a combination of rolling() and shift():
df.index = pd.to_datetime(df.index)
df.sort_index(inplace=True, ascending=False)
df['Comparison'] = False
df['Comparison'] = df.loc[:, 'L'] > df.loc[:, 'L'].rolling(window=2).min().shift(-2)
With rolling() you get the minimum of the last two days, shift() moves it to the right row.

Python - Calculating standard deviation (row level) of dataframe columns

I have created a Pandas Dataframe and am able to determine the standard deviation of one or more columns of this dataframe (column level). I need to determine the standard deviation for all the rows of a particular column. Below are the commands that I have tried so far
# Will determine the standard deviation of all the numerical columns by default.
inp_df.std()
salary 8.194421e-01
num_months 3.690081e+05
no_of_hours 2.518869e+02
# Same as above command. Performs the standard deviation at the column level.
inp_df.std(axis = 0)
# Determines the standard deviation over only the salary column of the dataframe.
inp_df[['salary']].std()
salary 8.194421e-01
# Determines Standard Deviation for every row present in the dataframe. But it
# does this for the entire row and it will output values in a single column.
# One std value for each row.
inp_df.std(axis=1)
0 4.374107e+12
1 4.377543e+12
2 4.374026e+12
3 4.374046e+12
4 4.374112e+12
5 4.373926e+12
When I execute the below command I am getting "NaN" for all the records. Is there a way to resolve this?
# Trying to determine standard deviation only for the "salary" column at the
# row level.
inp_df[['salary']].std(axis = 1)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
It is expected, because if checking DataFrame.std:
Normalized by N-1 by default. This can be changed using the ddof argument
If you have one element, you're doing a division by 0. So if you have one column and want the sample standard deviation over columns, get all the missing values.
Sample:
inp_df = pd.DataFrame({'salary':[10,20,30],
'num_months':[1,2,3],
'no_of_hours':[2,5,6]})
print (inp_df)
salary num_months no_of_hours
0 10 1 2
1 20 2 5
2 30 3 6
Select one column by one [] for Series:
print (inp_df['salary'])
0 10
1 20
2 30
Name: salary, dtype: int64
Get std of Series - get a scalar:
print (inp_df['salary'].std())
10.0
Select one column by double [] for one column DataFrame:
print (inp_df[['salary']])
salary
0 10
1 20
2 30
Get std of DataFrame per index (default value) - get one element Series:
print (inp_df[['salary']].std())
#same like
#print (inp_df[['salary']].std(axis=0))
salary 10.0
dtype: float64
Get std of DataFrame per columns (axis=1) - get all NaNs:
print (inp_df[['salary']].std(axis = 1))
0 NaN
1 NaN
2 NaN
dtype: float64
If changed default ddof=1 to ddof=0:
print (inp_df[['salary']].std(axis = 1, ddof=0))
0 0.0
1 0.0
2 0.0
dtype: float64
If you want std by two or more columns:
#select 2 columns
print (inp_df[['salary', 'num_months']])
salary num_months
0 10 1
1 20 2
2 30 3
#std by index
print (inp_df[['salary','num_months']].std())
salary 10.0
num_months 1.0
dtype: float64
#std by columns
print (inp_df[['salary','no_of_hours']].std(axis = 1))
0 5.656854
1 10.606602
2 16.970563
dtype: float64

python-3: how to create a new pandas column as subtraction of two consecutive rows of another column?

I have a pandas dataframe
x
1
3
4
7
10
I want to create a new column y as y[i] = x[i] - x[i-1] (and y[0] = x[0]).
So the above data frame will become:
x y
1 1
3 2
4 1
7 3
10 3
How to do that with python-3? Many thanks
Using .shift() and fillna():
df['y'] = (df['x'] - df['x'].shift(1)).fillna(df['x'])
To explain what this is doing, if we print(df['x'].shift(1)) we get the following series:
0 NaN
1 1.0
2 3.0
3 4.0
4 7.0
Which is your values from 'x' shifted down one row. The first row gets NaN because there is no value above it to shift down. So, when we do:
print(df['x'] - df['x'].shift(1))
We get:
0 NaN
1 2.0
2 1.0
3 3.0
4 3.0
Which is your subtracted values, but in our first row we get a NaN again. To clear this, we use .fillna(), telling it that we want to just take the value from df['x'] whenever a null value is encountered.

Randomly select a percentage of columns and rows in a dataframe (Pandas, Python 3)

I am trying to randomly select a certain percentage of rows and columns in my dataframe and fit these features into a logistic regression over 10 iterations. My dependent variable is whether a team won (1) or lost (0).
If I have a df that looks something like this (data is made up):
Won Field Injuries Weather Fouls Players
1 2 3 1 2 8
0 3 2 0 1 5
1 4 5 3 2 6
1 3 2 1 4 5
0 2 3 0 1 6
1 4 2 0 2 8
...
For example, let's say I want to select 50% (but this could change). I want to randomly select 50% (or the closest amount to 50% if its an odd number) of the columns (field,injuries,weather,fouls,players) and 50% of the rows in those columns to place in my model.
Here is my current code which right now runs by selecting all of the columns and rows and fitting it into my model but I would like to dictate a random percentage:
z = []
For i in range(10):
train_cols = df.columns[1:]
logit = sm.Logit(df['Won'], df[train_cols])
result = logit.fit()
exp = np.exp(result.params)
z.append([i, exp])

Resources