Vectorized way to find if 1 value in row from list of columns is greater than threshold - python-3.x

I am new to using np.where(), so of course I tried to use my new toy for this problem. But it does not work.
I have a dataframe.
Close M N O P
0.1 0.2 0.3 0.4 0.5
0.2 0.1 0.6 0.1 0.0
Colslist = [M,N,O,P]
I want a new column called Q with the result of this formula. The formula I thought up is:
df["Q"] = np.where(df[Colslist] >= (Close + 0.3), 1,0)
The output would be a 1 in both rows. In row 0 there are 2 values greater than, and row 1 there is 1 value greater than.
I believe the problem in what I wrote is it requires all values to be greater than to output a 1.
So what I am needing is a 1 in Q column if there is a single value in that row for the list of columns that is greater than equal to close of the same row + 0.3.
What's the best vectorized way to do this?

The problem is that the axes don't match in your condition. The output of
df[Colslist] >= (df['Close'] + 0.3)
is
M N O P 0 1
0 False False False False False False
1 False False False False False False
which doesn't make sense.
You could use ge method to make sure that you're comparing values in Colslist with "Close" column values. So the result of:
df[Colslist].ge(df['Close'] + 0.3, axis=0)
is
M N O P
0 False False True True
1 False True False False
Now, since your condition is that it is True if there is at least one True in a row, you can use any on axis=1. Then the final code:
Colslist = ['M','N','O','P']
df["Q"] = np.where(df[Colslist].ge(df['Close'] + 0.3, axis=0).any(axis=1), 1,0)
Output:
Close M N O P Q
0 0.1 0.2 0.3 0.4 0.5 1
1 0.2 0.1 0.6 0.1 0.0 1

Here's a one-liner with pure Pandas so you don't need to convert between NumPy and Pandas (might help with speed if your dataframe is really long):
df["Q"] = (df.max(axis=1) >= df["Close"] + 0.3).astype(int)
It finds the max value in each row and sees if it's larger than the required value. If the max value isn't large enough, then no value in the row will be. It takes advantage of the fact that you don't actually need to count the number of elements in a row that are greater than df["Close"] + 0.3; you just need to know if at least 1 element meets the condition.
Then it converts the True and False answers to 1 and 0 using astype(int).

Related

Updating Pandas data fram cells by condition

I have a data frame and want to update specific cells in a column based on a condition on another column.
ID Name Metric Unit Value
2 1 K2 M1 msecond 1
3 1 K2 M2 NaN 10
4 2 K2 M1 usecond 500
5 2 K2 M2 NaN 8
The condition is, if Unit string is msecond, then multiply the corresponding value in Value column by 1000 and store it in the same place. Considering a constant step for row iteration (two-by-two), the following code is not correct
i = 0
while i < len(df_group):
x = df.iloc[i].at["Unit"]
if x == 'msecond':
df.iloc[i].at["Value"] = df.iloc[i].at["Value"] * 1000
i += 2
However, the output is the same as before modifications. How can I fix that? Also what are the alternatives for better coding instead of that while loop?
A much simpler (and more efficient) form would be to use loc:
df.loc[df['Unit'] == 'msecond', 'Value'] *= 100
If you consider it essentially to only update a specific step of indexes:
step = 2
start = 0
df.loc[df['Unit'].eq('msecond') & (df.index % step == start), 'Value'] *= 100

Dataframe - Create new column based on a given column's previous & current row value

I am dealing with a dataframe which has several thousands of rows & has several columns. The column of interest is called customer_csate_score.
The data looks like below
customer_csate_score
0.000
-0.4
0
0.578
0.418
-0.765
0.89
What I'm trying to do is create a new column in the dataframe called customer_csate_score_toggle_status which will have true if the value changed either from a positive value to a negative value or vice-versa. It will have value false if the polarity didn't reverse.
Expected Output for toggle status column
customer_csate_score_toggle_status
False
True
False
True
False
True
True
I've tried few different things but haven't been able to get this to work. Here's what I've tried -
Attempt - 1
def update_cust_csate_polarity(curr_val, prev_val):
return True if (prev_val <= 0 and curr_val > 0) or (prev_val >= 0 and curr_val < 0) else False
data['customer_csate_score_toggle_status'] = data.apply(lambda x: update_cust_csate_polarity(data['customer_csate_score'], data['customer_csate_score'].shift()))
Attempt - 2
//Testing just one condition
data['customer_csate_score_toggle_status'] = data[(data['customer_csate_score'].shift() < 0) & (data['customer_csate_score']) > 0]
Could I please request help to get this right?
Calculate the sign change using np.sign(df.customer_csate_score)[lambda x: x != 0].diff() != 0:
Get the sign of values;
Filter out 0s so sequence like 5 0 1 won't get marked incorrectly;
Check if the sign has changed using diff.
import numpy as np
df['customer_csate_score_toggle_status'] = np.sign(df.customer_csate_score)[lambda x: x != 0].diff() != 0
df['customer_csate_score_toggle_status'] = df['customer_csate_score_toggle_status'].fillna(False)
df
customer_csate_score customer_csate_score_toggle_status
0 0.000 False
1 -0.400 True
2 0.000 False
3 0.578 True
4 0.418 False
5 -0.765 True
6 0.890 True

Iterating over columns and comparing each row value of that column to another column's value in Pandas

I am trying to iterate through a range of 3 columns (named 0 ,1, 2). in each iteration of that column I want to compare each row-wise value to another column called Flag (row-wise comparison for equality) in the same frame. I then want to return the matching field.
I want to check if the values match.
Maybe there is an easier approach to concatenate those columns into a single list then iterate through that list and see if there are any matches to that extra column? I am not very well versed in Pandas or Numpy yet.
I'm trying to think of something efficient as well as I have a large data set to perform this on.
Most of this is pretty free thought so I am just trying lots of different methods
Some attempts so far using the iterate over each column method:
##Sample Data
df = pd.DataFrame([['123','456','789','123'],['357','125','234','863'],['168','298','573','298'], ['123','234','573','902']])
df = df.rename(columns = {3: 'Flag'})
##Loop to find matches
i = 0
while i <= 2:
df['Matches'] = df[i].equals(df['Flag'])
i += 1
My thought process is to iterate over each column named 0 - 2, check to see if the row-wise values match between 'Flag' and the columns 0-2. Then return if they matched or not. I am not entirely sure which would be the best way to store the match result.
Maybe utilizing a different structured approach would be beneficial.
I provided a sample frame that should have some matches if I can execute this properly.
Thanks for any help.
You can use iloc in combination with eq than return the row if any of the columns match with .any:
m = df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
df['indicator'] = m
0 1 2 Flag indicator
0 123 456 789 123 True
1 357 125 234 863 False
2 168 298 573 298 True
3 123 234 573 902 False
The result you get back you can select by boolean indexing:
df.iloc[:, :-1].eq(df['Flag'], axis=0)
0 1 2
0 True False False
1 False False False
2 False True False
3 False False False
Then if we chain it with any:
df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
0 True
1 False
2 True
3 False
dtype: bool

Gnuplot summing y values for same x values

I have a dataset which looks like this:
0 1 0.1
0 0 0.1
0 1 0.1
1 0 0.2
0 1 0.2
1 0 0.2
...
I now want to do the following operations on each different value in the third column of the table:
Example for 0.1:
First column values summed: 0+0+0=0
Second column values summed: 1+0+1=2
Now I want to substract these two 2-0=2 and in a last step divide them by the occurrences.
2/3 =0.667
The same for 0.2 and my plot should then plot at x=0.1, y=0.667.
I hope my problem is with the example understandable.
You can use the smooth unique option to do exactly this: sum up all y-values belonging to the same x-value and then divide the result by the number of occurences. For the second column, upon which the operation is performed, you use the difference between the second and first column:
plot 'file.txt' using 3:($2 - $1) smooth unique
However, it seems like you'll run in a strange bug then. This works only correct, if you insert an empty or commented row at the beginning of your data file:
The result with the following file.txt
#
0 1 0.1
0 0 0.1
0 1 0.1
1 0 0.2
0 1 0.2
1 0 0.2
is

complex rounding decimals in excel spreadsheet

Have created formula for example =(E2+(G2*37))/290 which returns decimals based on data entered but can't work out how to round the answer to get the following:
If 0 to 0.39 round to 0
If 0.4 to 0.89 round to 0.5
If 0.9 to 1.39 round to 1
If 1.4 to 1.89 round to 1.5
If 1.9 to 2.39 round to 2 etc
Hope my question makes sense. Thanks
Your custom rounding is simply rounding to the nearest 0.5.....but with an "offset" of 0.15. With value to be rounded in A1 you can do that with a simple formula, i.e.
=ROUND(A1*2-0.3,0)/2
or with your calculation in place of A1 that becomes
=ROUND((E2+G2*37)/290*2-0.3,0)/2
It's a bit convoluted but this should work:
=IF(A1 + 0.1 - ROUNDDOWN(A1+0.1;0) < 0.5; ROUNDDOWN(A1+0.1;0); ROUNDDOWN(A1+0.1;0) + 0.5)
where A1 is the cell you want to round.
e.g.
N.B.
This works only for positive numbers.
Ranges are actually:
[0, 0.39999999...] [0.4 , 0.8999999...] ...
or equally:
[0, 0.4) [0.4 , 0.9) ...
You could define a VBA function to do this for you:
Public Function CustomRound(number As Double)
Dim result As Double, fraction As Double
fraction = number - Int(number)
If number <= 0.39 Then
result = 0
ElseIf fraction >= 0.4 And fraction <= 0.9 Then
result = Int(number) + 0.5
Else
result = Round(number, 0)
End If
CustomRound = result
End Function
You would call this as follows:
=CustomRound((E2+(G2*37))/290)
Example:
=CustomRound(0.23) // 0
=CustomRound(1.58) //1.5
=CustomRound(2.12) //2

Resources