Vectorized version of df.apply(lambda x: x.value_counts()) - python-3.x

I've got a dataframe with a somewhat large amount of time series balances in it. It looks something like
Run1 Run2 Run3 ... Run10000
2018 100 100 100 100
2019 101.2 99.2 101.0 ... 101.6
...
2038 142.2 151.3 102.7 ... 173.0
Essentially I want to check to see how many trials ran dipped below a certain number, for example 90% of the starting balance.
Currently I am doing
((portfolio_values < starting_value*0.9).apply(lambda x: x.value_counts()).loc[True] > 0).value_counts().loc[True]
Sorry that one liner is pretty atrocious, but the idea is that it creates a mask based on whether a value in the table is below 90% of the starting value, then it goes through and does a count of True and False values. It then checks which of those columns has some non-zero number of True values (meaning yes, it did dip below 90%), then it counts up how many of those values are true.
The problem is that this is really slow, and I'm sure Pandas has some kind of function that does exactly what I'm looking for, as it normally does.
Thanks in advance!

Can you use:
(portfolio_values < starting_values*.9).any().sum()
any returns True for each column where the condition is met at least once in the column, then use sum to count the columns or "runs" in your case.

Try this:
mask_df = df < starting_value*0.9
result = mask_df.any()
I tested it in a console on a dummy example and it appears to work.

Related

How can I replace a particular column in a data frame based on a condition (categorical variables)?

I need to replace the salary status to 1 or 0 respectively if the salary is greater than 50,000 or less than or equal to 50,000 in a df.
The DataFrame shape:30162*13
I have tried this:
data2['SalStat']=data2['SalStat'].map({"less than or equal to 50,000":0,"greater than 50,000":1})
I also tried data2['SalStat']
and loc without any success.
How can I do the same?
I think your solution is nice.
If want match only by substring, e.g. by greater use Series.str.contains for boolean mask with converting to 0,1:
data2['SalStat']=data2['SalStat'].str.contains('greater').astype(int)
Or:
data2['SalStat']=data2['SalStat'].str.contains('greater').view('i1')
Try this
def status(d): return 0 if d == 'less than or equal to 50,000' else 1
data2['SalStat'] = list(map(status ,data2['SalStat']))

Excel - averaging an n amount of rows based on condition in prior column

I have this table in excel:
Date value
1/2/1970 100.00
1/5/1970 99.99
1/6/1970 100.37
1/7/1970 100.74
1/8/1970 101.26
1/9/1970 100.74
1/12/1970 100.79
1/13/1970 101.27
1/14/1970 101.95
1/15/1970 101.97
1/16/1970 101.76
1/19/1970 102.21
1/20/1970 102.70
1/21/1970 102.00
1/22/1970 101.46
1/23/1970 101.49
1/26/1970 100.97
1/27/1970 101.45
1/28/1970 101.70
1/29/1970 102.08
1/30/1970 102.19
2/2/1970 102.02
2/3/1970 101.85
These are values that I have daily, and I need to construct a sheet that takes a monthly index of the daily values, example below:
date index
1/31/1970 some_index
2/28/1970 some_index
3/31/1970 some_index
4/30/1970 some_index
I could only get this far when it came to getting the index of 30 days:
=AVERAGE(INDEX(B:B,1+30*(ROW()-ROW($C$1))):INDEX(B:B,30*(ROW()-ROW($C$1)+1)))
I'm just not sure how to structure this in the most efficient, yet correct way possible. Not all months are the same amount of days, so I was hoping to check to get all the next n rows where the date starts with a "1" for example, sometimes certain days are also missing. I can't think of a catch all approach.
With 1/31/1970 in C1 try this,
=averageifs(daily!b:b, daily!a:a, "<="&c1, daily!a:a, ">="&eomonth(c1, -1)+1)
A PivotTable might be more convenient:

multiple condition Median If formula

I'm trying to calculate the Median since a pivot table won't work.
I have a number of conditons that i need to fulfill so i need a
={median(if(and(A:A=A2,B:B=B2,C:C=C2,D:D=D2),T:T,"")}
type formula.
Columns A, B, C and D have the criteria and T has the value that I need the Median of.
I have been able to produce a median with just 1 variable, but i'm only getting #n/a when i try more.
I have seen that an AND function doesn't work with an Array, so is there another way that I can calculate the mean based upon 4 different conditions?
Any Help would be greatly appreciated!
Ed
Array formula do not like AND or OR so use * and + respectively to turn the TRUE and FALSE of each of the Boolean test to 1 and 0 respectively.
So with * if any are FALSE it will be 0 and turn the whole to 0, where as with + if any are TRUE then it will be greater than 0 and the IF will return the TRUE result:
=median(if((A:A=A2)*(B:B=B2)*(C:C=C2)*(D:D=D2),T:T))
If you are using Google Sheet (If not, you should :) )
Above, can be achieved using combination of MEDIAN and FILTER functions.
FILTER(range, condition1, [condition2, ...])
=MEDIAN(FILTER(T:T, A:A=A2, B:B=B2, C:C=C2, D:D=D2)
It filters T:T based on the conditions provided next, then Median of the result is returned.

Python Pandas: Average column if

In MS Excel there is a handy formula =AVERAGEIF(values, criteria).
Is there a similar way to average values within one columns that conform to certain condition?
I have a column of values in my data frame from -5000 to +5000.
I need to average values between -5000 <= x < 0
And separately average values between 0 < x <= 5000.
NOTE: I'd like to avoid applying Boolean mask and therefore creating new dataframe, because I have lots of columns.
Any help, suggestions, or edits to this post are welcome.
Using Boolean mask actually does what I need.
df[df>0].mean(axis=0,skipna=True,numeric_only=True)
It returns as many single values as I have columns. Perfect!

Formula to compare time values

Below excel formula is working fine but in some cases its not give me proper value .
Input:
19:20:42
24:58:36
26:11:18
After using this formula:
=IF(TIMEVALUE(K7)>TIMEVALUE("09:00:00"),TRUE,FALSE)
I got the below output:
FALSE
TRUE
TRUE
What I Observe if the time value is > or = 24:00:00 it will not give me the proper answer.
How do I fix this?
As an alternative to Captain's excellent answer, you could also use:
=IF(K7>(9/24),TRUE,FALSE)
DateTime values are internally stored as a number of days since 1/1/1900,
therefore 1 = 1 day = 24 hours. So 9/24 = 0.375 = 9 hours :-)
You can easily verify this by clearing the format of your DateTime cells.
Edit: note that such Boolean formula can be expressed in a shorter way without losing legibility:
=K7>(9/24)
When you go over 24 hours, Excel counts it as the next day... and then the TIMEVALUE is the time the next day (i.e. 00:58:36 and 02:11:18 in your examples) and can, therefore, be before 0900.
You could do DATEVALUE(K7)+TIMEVALUE(K7) to ensure that you count the day part too...

Resources