Dataframe - Create new column based on a given column's previous & current row value - python-3.x

I am dealing with a dataframe which has several thousands of rows & has several columns. The column of interest is called customer_csate_score.
The data looks like below
customer_csate_score
0.000
-0.4
0
0.578
0.418
-0.765
0.89
What I'm trying to do is create a new column in the dataframe called customer_csate_score_toggle_status which will have true if the value changed either from a positive value to a negative value or vice-versa. It will have value false if the polarity didn't reverse.
Expected Output for toggle status column
customer_csate_score_toggle_status
False
True
False
True
False
True
True
I've tried few different things but haven't been able to get this to work. Here's what I've tried -
Attempt - 1
def update_cust_csate_polarity(curr_val, prev_val):
return True if (prev_val <= 0 and curr_val > 0) or (prev_val >= 0 and curr_val < 0) else False
data['customer_csate_score_toggle_status'] = data.apply(lambda x: update_cust_csate_polarity(data['customer_csate_score'], data['customer_csate_score'].shift()))
Attempt - 2
//Testing just one condition
data['customer_csate_score_toggle_status'] = data[(data['customer_csate_score'].shift() < 0) & (data['customer_csate_score']) > 0]
Could I please request help to get this right?

Calculate the sign change using np.sign(df.customer_csate_score)[lambda x: x != 0].diff() != 0:
Get the sign of values;
Filter out 0s so sequence like 5 0 1 won't get marked incorrectly;
Check if the sign has changed using diff.
import numpy as np
df['customer_csate_score_toggle_status'] = np.sign(df.customer_csate_score)[lambda x: x != 0].diff() != 0
df['customer_csate_score_toggle_status'] = df['customer_csate_score_toggle_status'].fillna(False)
df
customer_csate_score customer_csate_score_toggle_status
0 0.000 False
1 -0.400 True
2 0.000 False
3 0.578 True
4 0.418 False
5 -0.765 True
6 0.890 True

Related

Vectorized way to find if 1 value in row from list of columns is greater than threshold

I am new to using np.where(), so of course I tried to use my new toy for this problem. But it does not work.
I have a dataframe.
Close M N O P
0.1 0.2 0.3 0.4 0.5
0.2 0.1 0.6 0.1 0.0
Colslist = [M,N,O,P]
I want a new column called Q with the result of this formula. The formula I thought up is:
df["Q"] = np.where(df[Colslist] >= (Close + 0.3), 1,0)
The output would be a 1 in both rows. In row 0 there are 2 values greater than, and row 1 there is 1 value greater than.
I believe the problem in what I wrote is it requires all values to be greater than to output a 1.
So what I am needing is a 1 in Q column if there is a single value in that row for the list of columns that is greater than equal to close of the same row + 0.3.
What's the best vectorized way to do this?
The problem is that the axes don't match in your condition. The output of
df[Colslist] >= (df['Close'] + 0.3)
is
M N O P 0 1
0 False False False False False False
1 False False False False False False
which doesn't make sense.
You could use ge method to make sure that you're comparing values in Colslist with "Close" column values. So the result of:
df[Colslist].ge(df['Close'] + 0.3, axis=0)
is
M N O P
0 False False True True
1 False True False False
Now, since your condition is that it is True if there is at least one True in a row, you can use any on axis=1. Then the final code:
Colslist = ['M','N','O','P']
df["Q"] = np.where(df[Colslist].ge(df['Close'] + 0.3, axis=0).any(axis=1), 1,0)
Output:
Close M N O P Q
0 0.1 0.2 0.3 0.4 0.5 1
1 0.2 0.1 0.6 0.1 0.0 1
Here's a one-liner with pure Pandas so you don't need to convert between NumPy and Pandas (might help with speed if your dataframe is really long):
df["Q"] = (df.max(axis=1) >= df["Close"] + 0.3).astype(int)
It finds the max value in each row and sees if it's larger than the required value. If the max value isn't large enough, then no value in the row will be. It takes advantage of the fact that you don't actually need to count the number of elements in a row that are greater than df["Close"] + 0.3; you just need to know if at least 1 element meets the condition.
Then it converts the True and False answers to 1 and 0 using astype(int).

Python and Pandas, find rows that contain value, target column has many sets of ranges

I have a messy dataframe where I am trying to "flag" the rows that contain a certain number in the ids column. The values in this column represent an inclusive range: for example, "row 4" contains the following numbers:
2409,2410,2411,2412,2413,2414,2377,2378,1478,1479,1480,1481,1482,1483,1484 And in "row 0" and "row 1" the range for one of the sets is backwards (1931,1930,1929)
If I want to know which rows have sets that contain "2340" and "1930" for example, how would I do this? I think a loop is needed, sometimes will need to query more than just two numbers. Using Python 3.8.
Example Dataframe
x = ['1331:1332,1552:1551,1931:1928,1965:1973,1831:1811,1927:1920',
'1331:1332,1552:1551,1931:1929,180:178,1966:1973,1831:1811,1927:1920',
'2340:2341,1142:1143,1594:1593,1597:1596,1310,1311',
'2339:2341,1142:1143,1594:1593,1597:1596,1310:1318,1977:1974',
'2409:2414,2377:2378,1478:1484',
'2474:2476',
]
y = [6.48,7.02,7.02,6.55,5.99,6.39,]
df = pd.DataFrame(list(zip(x, y)), columns =['ids', 'val'])
display(df)
Desired Output Dataframe
I would write a function that perform 2 steps:
Given the ids_string that contains the range of ids, list all the ids as ids_num_list
Check if the query_id is in the ids_num_list
def check_num_in_ids_string(ids_string, query_id):
# Convert ids_string to ids_num_list
ids_range_list = ids_string.split(',')
ids_num_list = set()
for ids_range in ids_range_list:
if ':' in ids_range:
lower, upper = sorted(ids_range.split(":"))
num_list = list(range(int(lower), int(upper)+ 1))
ids_num_list.update(num_list)
else:
ids_num_list.add(int(ids_range))
# Check if query number is in the list
if int(query_id) in ids_num_list:
return 1
else:
return 0
# Example usage
query_id_list = ['2340', '1930']
for query_id in query_id_list:
df[f'n{query_id}'] = (
df['ids']
.apply(lambda x : check_num_in_ids_string(x, query_id))
)
which returns you what you require:
ids val n2340 n1930
0 1331:1332,1552:1551,1931:1928,1965:1973,1831:1... 6.48 0 1
1 1331:1332,1552:1551,1931:1929,180:178,1966:197... 7.02 0 1
2 2340:2341,1142:1143,1594:1593,1597:1596,1310,1311 7.02 1 0
3 2339:2341,1142:1143,1594:1593,1597:1596,1310:1... 6.55 1 0
4 2409:2414,2377:2378,1478:1484 5.99 0 0
5 2474:2476 6.39 0 0

In Python Pandas, how do I combine two columns containing strings using if/else statement or similar?

I have created a pandas dataframe from an excel file where first two columns are:
df = pd.DataFrame({'0':['','','Location Code','pH','Ag','Alkalinity'], '1':['Lab Id','Collection Date','','','µg/L','mg/L']})
which looks like this:
df[0] df[1]
Lab Id
Collection Date
Location Code
pH
Ag µg/L
Alkalinity mg/L
I want to merge these columns into one that looks like this:
df[0]
Lab Id
Collection Date
Location Code
pH
Ag (µg/L)
Alkalinity (mg/L)
I believe I need a control statement before combining df[0] and df[1] which would appear like this:
if **there is a blank space in either column, then it performs**:
df[0] = df[0].astype(str)+df[1].astype(str)
else:
df[0] = df[0].astype(str)+' ('+df[1].astype(str)+')'
but I am not sure how to write the if statement. Could anyone please guide me here.
Thank you very much.
We can try np.select
cond=[(df['0']=='') & (df['1']!=''), (df['0']!='') & (df['1']==''), (df['0']!='') & (df['1'] !='')]
val=[df['1'], df['0'], df['0']+ '('+df['1']+')']
df['new']=np.select(cond,val)
df
0 1 new
0 Lab Id Lab Id
1 Collection Date Collection Date
2 Location Code Location Code
3 pH pH
4 Ag µg/L Ag(µg/L)
5 Alkalinity mg/L Alkalinity(mg/L)
if value is Na, maybe:
df['result'] = df[0].fillna(df[1])
This works using numpy where, and the string concatenation assumption is based on the data shared :
df.assign(
merger=np.where(
df["1"].str.endswith("/L"),
df["0"].str.cat(df["1"], "(").add(")"),
df["0"].str.cat(df["1"], ""),
)
)
0 1 merger
0 Lab Id Lab Id
1 Collection Date Collection Date
2 Location Code Location Code
3 pH pH
4 Ag µg/L Ag(µg/L)
5 Alkalinity mg/L Alkalinity(mg/L)
Or, you could just assign it to "0", if that is what you are after :
df["0"] = np.where(
df["1"].str.endswith("/L"),
df["0"].str.cat(df["1"], "(").add(")"),
df["0"].str.cat(df["1"], ""),
)
Here is another way:
First you replace values you are going to concat with the value + '()'
df['1'].loc[df.replace('', np.nan).notnull().all(axis =1 )] = '(' + df['1'] + ')'
Now we fill in missing values with bfill and ffill
df = df.replace('', np.nan).bfill(axis = 1).ffill(axis = 1)
Only thing remaining, is to merge values wherever we have brackets
df.loc[:, 'merge'] = np.where(df['1'].str.endswith(')'), df['0'] + df['1'], df['1'])
Test if empty value at least in one column 0,1 by DataFrame.eq and DataFrame.any and then join both columns like in your answer in numpy.where:
df = pd.DataFrame({0:['','','Location Code','pH','Ag','Alkalinity'],
1:['Lab Id','Collection Date','','',u'µg/L','mg/L']})
print (df[[0,1]].eq(''))
0 1
0 True False
1 True False
2 False True
3 False True
4 False False
5 False False
print (df[[0,1]].eq('').any(axis=1))
0 True
1 True
2 True
3 True
4 False
5 False
dtype: bool
df[0] = np.where(df[[0,1]].eq('').any(axis=1),
df[0].astype(str)+df[1].astype(str),
df[0].astype(str)+' ('+df[1].astype(str)+')')
print (df)
0 1
0 Lab Id Lab Id
1 Collection Date Collection Date
2 Location Code
3 pH
4 Ag (µg/L) µg/L
5 Alkalinity (mg/L) mg/L

Iterating over columns and comparing each row value of that column to another column's value in Pandas

I am trying to iterate through a range of 3 columns (named 0 ,1, 2). in each iteration of that column I want to compare each row-wise value to another column called Flag (row-wise comparison for equality) in the same frame. I then want to return the matching field.
I want to check if the values match.
Maybe there is an easier approach to concatenate those columns into a single list then iterate through that list and see if there are any matches to that extra column? I am not very well versed in Pandas or Numpy yet.
I'm trying to think of something efficient as well as I have a large data set to perform this on.
Most of this is pretty free thought so I am just trying lots of different methods
Some attempts so far using the iterate over each column method:
##Sample Data
df = pd.DataFrame([['123','456','789','123'],['357','125','234','863'],['168','298','573','298'], ['123','234','573','902']])
df = df.rename(columns = {3: 'Flag'})
##Loop to find matches
i = 0
while i <= 2:
df['Matches'] = df[i].equals(df['Flag'])
i += 1
My thought process is to iterate over each column named 0 - 2, check to see if the row-wise values match between 'Flag' and the columns 0-2. Then return if they matched or not. I am not entirely sure which would be the best way to store the match result.
Maybe utilizing a different structured approach would be beneficial.
I provided a sample frame that should have some matches if I can execute this properly.
Thanks for any help.
You can use iloc in combination with eq than return the row if any of the columns match with .any:
m = df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
df['indicator'] = m
0 1 2 Flag indicator
0 123 456 789 123 True
1 357 125 234 863 False
2 168 298 573 298 True
3 123 234 573 902 False
The result you get back you can select by boolean indexing:
df.iloc[:, :-1].eq(df['Flag'], axis=0)
0 1 2
0 True False False
1 False False False
2 False True False
3 False False False
Then if we chain it with any:
df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
0 True
1 False
2 True
3 False
dtype: bool

selecting different columns each row

I have a dataframe which has 500K rows and 7 columns for days and include start and end day.
I search a value(like equal 0) in range(startDay, endDay)
Such as, for id_1, startDay=1, and endDay=7, so, I should seek a value D1 to D7 columns.
For id_2, startDay=4, and endDay=7, so, I should seek a value D4 to D7 columns.
However, I couldn't seek different column range successfully.
Above-mentioned,
if startDay > endDay, I should see "-999"
else, I need to find first zero (consider the day range) and such as for id_3's, first zero in D2 column(day 2). And starDay of id_3 is 1. And I want to see, 2-1=1 (D2 - StartDay)
if I cannot find 0, I want to see "8"
Here is my data;
data = {
'D1':[0,1,1,0,1,1,0,0,0,1],
'D2':[2,0,0,1,2,2,1,2,0,4],
'D3':[0,0,1,0,1,1,1,0,1,0],
'D4':[3,3,3,1,3,2,3,0,3,3],
'D5':[0,0,3,3,4,0,4,2,3,1],
'D6':[2,1,1,0,3,2,1,2,2,1],
'D7':[2,3,0,0,3,1,3,2,1,3],
'startDay':[1,4,1,1,3,3,2,2,5,2],
'endDay':[7,7,6,7,7,7,2,1,7,6]
}
data_idx = ['id_1','id_2','id_3','id_4','id_5',
'id_6','id_7','id_8','id_9','id_10']
df = pd.DataFrame(data, index=data_idx)
What I want to see;
df_need = pd.DataFrame([0,1,1,0,8,2,8,-999,8,1], index=data_idx)
You can create boolean array to check in each row which 'Dx' column(s) are above 'startDay' and below 'endDay' and the value is equal to 0. For the first two conditions, you can use np.ufunc.outer with the ufunc being np.less_equal and np.greater_equal such as:
import numpy as np
arr_bool = ( np.less_equal.outer(df.startDay, range(1,8)) # which columns Dx is above startDay
& np.greater_equal.outer(df.endDay, range(1,8)) # which columns Dx is under endDay
& (df.filter(regex='D[0-9]').values == 0)) #which value of the columns Dx are 0
Then you can use np.argmax to find the first True per row. By adding 1 and removing 'startDay', you get the values you are looking for. Then you need to look for the other conditions with np.select to replace values by -999 if df.startDay >= df.endDay or 8 if no True in the row of arr_bool such as:
df_need = pd.DataFrame( (np.argmax(arr_bool , axis=1) + 1 - df.startDay).values,
index=data_idx, columns=['need'])
df_need.need= np.select( condlist = [df.startDay >= df.endDay, ~arr_bool.any(axis=1)],
choicelist = [ -999, 8],
default = df_need.need)
print (df_need)
need
id_1 0
id_2 1
id_3 1
id_4 0
id_5 8
id_6 2
id_7 -999
id_8 -999
id_9 8
id_10 1
One note: to get -999 for id_7, I used the condition df.startDay >= df.endDay in np.select and not df.startDay > df.endDay like in your question, but you can cahnge to strict comparison, you get 8 instead of -999 in this case.

Resources