I am analysing a stock-index in 4-year cycles and would like to start with a base of 1 at the beginning of the first year and the calculate the returns on top, so that I get a column in the dataframe that goes 1, 1.02, 1.03, 1.025...
The return indexed calculation would be (todaysValue/yesterdaysValue)*yesterdaysIndexValue.
The df looks like this:
Datetimeindex Stockindex CycleYear Diff Daynumber
01.01.1968 96.47 0 1 1
...
03.01.1972 101.67 0 1 1
...
06.09.2022 3908.19 2 0 699
07.09.2022 3979.87 2 0 700
08.09.2022 4006.18 2 0 701
I would now like to add a column df['Retindex'] that starts every 4 years at 1 and calculates the indexed-returns until the end of year 4.
I have created the column to that has a 1 at the start of each cycle.
df['Retindex'] = df['Daynumber'].loc[df['Daynumber'] == 1]
Then I tried creating the rest of the index with this:
for id in df[df['Retindex'].isnull() == True].index: df.loc[id, 'Retindex'] = (df[Stockindex]/df[Stockindex].shift().loc[id]) * df['Retindex'].shift().loc[id]
Here I am getting the error: "ValueError: Incompatible indexer with Series"
I have tried other ways as well but I am unfortunately not progressing on this. Can anyone help?
Related
I have a very large dataset with over 400,000 rows and growing. I understand that you are not supposed to use iterows to modify a pandas data frame. However I'm a little lost on what I should do in this case, since I'm not sure I could use .loc() or some rolling filter to modify a data frame in the way I need to. I'm trying to figure out if I can take a data frame and average the range while the condition is met. For example:
Condition
Temp.
Pressure
1
8
20
1
7
23
1
8
22
1
9
21
0
4
33
0
3
35
1
9
21
1
11
20
1
10
22
While the condition is == 1 the outputed dataframe would look like this:
Condition
Avg. Temp.
Avg. Pressure
1
8
21.5
1
10
21
Has anyone attempted something similar that can put me on the right path? I was thinking of using something like this:
df = pd.csv_read(csv_file)
for index, row in df.iterrows():
if row['condition'] == 1:
#start index = first value that equals 1
else: #end index & calculate rolling average of range
len = end - start
new_df = df.rolling(len).mean()
I know that my code isn't great, I also know I could brute force it doing something similar as I have shown above, but as I said it has a lot of rows and continues to grow so I need to be efficient.
TRY:
result = df.groupby((df.Condition != df.Condition.shift()).cumsum()).apply(
lambda x: x.rolling(len(x)).mean().dropna()).reset_index(drop=True)
print(result.loc[result.Condition.eq(1)]) # filter by required condition
OUTPUT:
Condition Temp. Pressure
0 1.0 8.0 21.5
2 1.0 10.0 21.0
I'm working in pandas and I have a column in my dataframe filled by 0s and incrementing integers starting at one. I would like to add another column of integers but that column would be a counter of how many intervals separated by zero we have encountered to this point. For example my data would like like
Index
1
2
3
0
1
2
0
1
and I would like it to look like
Index IntervalCount
1 1
2 1
3 1
0 1
1 2
2 2
0 2
1 2
Is it possible to do this with vectorized operation or do I have to do this iteratively? Note, it's not important that it be a new column could also overwrite the old one.
You can use cumsum function.
df["IntervalCount"] = (df["Index"] == 1).cumsum()
I have been struggling with a problem with my data frame build in pandas that is current like this
MyDataFrame:
Index Status Value
0 A 10
1 A 8
2 A 5
3 B 9
4 B 5
5 A 1
6 B 2
7 A 3
8 A 5
9 A 1
The desired output would be:
Index Status Value
0 A 10
1 B 9
2 A 1
3 B 2
4 A 5
So far I tried to use range and while conditions to filter, however, if I put a conditional like :
for i in range:
if Status[i] == "A":
print(Value[i])
if Status == "B":
break
** The code above is more an example of what I have been trying to reach my goal, I tried to use .iloc and range with while, but maybe in the wrong way idk.*
The desired output isn't printed.
One thing that complicates this filtering process is that MyDataFrame changes every time that I run the script since it uses another base of data to create this DataFrame.
I believe that I'm missing something simple, but it has been almost a week and I can't figure out.
Thanks in advance for all your answers and support.
Let us try using shift with cumsum create the groupby key , then it is groupby + agg
out = df.groupby(df.Status.ne(df.Status.shift()).cumsum()).agg({'Status':'first','Value':'max'})
Out[14]:
Status Value
Status
1 A 10
2 B 9
3 A 1
4 B 2
5 A 5
Very close to #BEN_YO:
grp = (df['Status'] != df['Status'].shift()).cumsum()
df.loc[df.groupby(grp)['Value'].idxmax()]
Output:
Status Value
Index
0 A 10
3 B 9
5 A 1
6 B 2
8 A 5
Create groups using shift and inequality with cumsum, then groupby and find the index of the max value of 'Value', idxmax, and filter the dataframe using loc
I am trying to look up a value in a matrix based on a given date. The matrix has the first day of the week along the vertical axis, and the first day of the month along the horizontal axis.
For a given day, e.g. 31/08/15 I would like to match the exact date to the vertical axis of the matrix (i.e. 31/08/15), and the month to the horizontal axis (1/08/15).
So in the example below, an input of 31/08/15 should provide an output of 3.
01/06/2015 01/07/2015 01/08/2015 01/09/2015
03/08/2015 1 0 0 0
10/08/2015 0 2 0 0
17/08/2015 0 0 3 0
24/08/2015 0 0 0 4
31/08/2015 0 0 3 0
I am trying and failing with index and match formulae.
I have tried the following:
=index(area where to look, match(31/08/15,first column,0),match(and(month(31/08/15),year(31/08/15)),(and(month(first row),year(first row)),0)
Hope this is clear, thanks!
You can use an INDEX function with two MATCH functions top supply both the row and column.
The formula in D8 is,
=INDEX($B$2:$E$6,MATCH(C8,$A$2:$A$6,0),MATCH(DATE(YEAR(C8),MONTH(C8),1),$B$1:$E$1,0))
I'm a little concerned about the dates matching exactly down column A but a little maths manipulation with the WEEKDAY function would take care of that.
=INDEX($B$2:$E$6,MATCH(C9-WEEKDAY(C9, 2)+1,$A$2:$A$6,0),MATCH(DATE(YEAR(C9),MONTH(C9),1),$B$1:$E$1,0))
Here you go:
=INDEX($B$2:$E$6,MATCH(DATE(2015,8,31),$A$2:$A$6,),MATCH(DATE(2015,8,1),$B$1:$E$1,))
i have a slight issue to count the MAX frequency of where the third colmn is bigger than the second. This is just a statistic with scores.
The issue is that i want to have it in one single formula without a macro.
B C
------
2 0
1 2
2 1
2 3
0 1
1 2
0 1
3 3
0 2
0 2
i have tried it with:
{=MAX(FREQUENCY(B3:B100;B3:B100>=C3:C100))} to get 1 for B
{=MAX(FREQUENCY(C3:C100;C3:C100>=B3:B100))} to get 7 for C
I excpected it to deliver me the longest series where the value in the one column was bigger than in the other one, but i failed hard...
Try this version to get 7
=MAX(FREQUENCY(IF(C3:C100>=B3:B100,IF(B3:B100<>"",ROW(B3:B100))),IF(C3:C100<B3:B100,ROW(B3:B100))))
confirmed with CTRL+SHIFT+ENTER
obviously reverse the ranges to get your other result
See example here