Get row with a symbol after a particular index - python-3.x

I have a df:
Index col1
1 Abc
2 xyz
3 $123
4 wer
5 exr
6 ert
7 $546
8 $456
Problem Statement:
Now I want to find the index of the row containing the dollar sign after the keyword wer.
My Code:
idx = df.col1.str.contains('\$').idxmax() ## this gives me index 3 but what i want is index 7
Help need to modify my code to get the desired output

You need to mask the wer as well
s = (df['col1'].str.contains('\$') # rows containing $
& df['col1'].eq('wer').cumsum().gt(0) # rows after the first 'wer'
).idxmax()
# s == 7

Use:
#df=df.set_index('Index') #if 'index' is a column
df2=df[df['col1'].eq('wer').cumsum()>0]
df2['col1'].str.contains('\$').idxmax()
or:
df[(df['col1'].eq('wer').cumsum()>0) & df['col1'].str.contains('\$')].index[0]
Output:
7
Details:
df['col1'].eq('wer').cumsum().eq(1)
Index
1 False
2 False
3 False
4 True
5 True
6 True
7 True
8 True
Name: col1, dtype: bool
print(df2)
col1
Index
4 wer
5 exr
6 ert
7 $546
8 $456

Related

Compare nth letter in one column to a single letter in another

I have a df as follows:
Policy Letter Password Lower Upper Count Lower_Minus_1 Upper_Minus_1
0 4-5 l rllllj 4 5 4 3 4
1 4-10 s ssskssphrlpscsxrfsr 4 10 8 3 9
2 14-18 p ppppppppppppppppppp 14 18 19 13 17
3 1-6 z zzlzvmqbzzclrz 1 6 6 0 5
4 4-5 j jhjjhxhjkxj 4 5 5 3 4
Lower_Minus_1 value is to be used as an index to search that position in the password to see if it matches the letter in column 'Letter'.
This line works:
print(df['Password'].str[3] == df['Letter'])
However, it strictly returns True\False based upon the third position for the value in 'Password' for every single row.
First five:
0 True
1 False
2 True
3 True
4 True
I don't want the third position for every row. I want the Lower_Minus_1 position for each row.
I have tried the following but both fail:
print(df['Password'].str[df['Letter']] == df['Letter'])
Returns False for every single row as proven by:
print((df['Password'].str[df['Letter']] == df['Letter']).sum())
Returns: 0
Then I tried this:
print(df.apply(lambda x: x['Password'].str[x['Lower_Minus_1']], axis=1) == df['Letter'])
This throws an error:
File "D:/AofC/2020_day2.py", line 56, in <lambda>
print(df.apply(lambda x: x['Password'].str[x['Lower_Minus_1']], axis=1) == df['Letter'])
AttributeError: 'str' object has no attribute 'str'
df.apply(lambda x:x['Letter']== x['Password'][x.Lower_Minus_1], axis=1)
0 True
1 False
2 True
3 True
4 True
dtype: bool

How can I transform this dataset in pandas so that it easy to filter and compare?

I have the following DataFrame:
Segments Airline_pct_tesco Airline_pct_asda food_pct_tesco food_pct_asda Airline_diff food_diff
A 1 2 4 2 -1 2
B 2 2 4 4 0 0
c 10 5 12 10 5 2
I want to convert it to this format:
Segments Category Asda% Tesco% Diff%
A Airline 2 1 -1
b Food 4 4 0
c Airline 5 10 5
A Food 2 4 2
(only partially showing). Note
category is the col name without the '_pct_tesco' or '_diff' or '_pct_asda'
I am unsure how to go about this - I have tried transform but I just don't know how I can get it in a way which is easy for any user to use. I am doing this in pandas and am not sure how to even begin! The Asda% are related to '_pct_asda' columns and same for diff and tesco columns respectively..
Let's try set_index to save columns, then create a MultiIndex.from_frame using str.extract on the columns to create a MultiIndex based on the values before a list of suffixes, then stack to go to long-form.
new_df = df.set_index('Segments')
# Define allowed suffixes here
suffixes = ['_pct_asda', '_pct_tesco', '_diff']
# Extract Values
new_df.columns = (
pd.MultiIndex.from_frame(
new_df.columns.str.extract(rf'(.*?)({"|".join(suffixes)})'),
names=['Category', None]
)
)
new_df = new_df.stack(0)
new_df:
_diff _pct_asda _pct_tesco
Segments Category
A Airline -1 2 1
food 2 2 4
B Airline 0 2 2
food 0 4 4
c Airline 5 5 10
food 2 10 12
To get cleaner output add reset_index + rename to fix column names and index and also re-order columns.
new_df = new_df.reset_index().rename(columns={
'_pct_asda': 'Asda%',
'_pct_tesco': 'Tesco%',
'_diff': 'Diff%'
})[['Segments', 'Category', 'Asda%', 'Tesco%', 'Diff%']]
new_df:
Segments Category Asda% Tesco% Diff%
0 A Airline 2 1 -1
1 A food 2 4 2
2 B Airline 2 2 0
3 B food 4 4 0
4 c Airline 5 10 5
5 c food 10 12 2

Python create a column based on the values of each row of another column

I have a pandas dataframe as below:
import pandas as pd
df = pd.DataFrame({'ORDER':["A", "A", "A", "B", "B","B"], 'GROUP': ["A_2018_1B1", "A_2018_1B1", "A_2018_1M1", "B_2018_I000_1C1", "B_2018_I000_1B1", "B_2018_I000_1C1H"], 'VAL':[1,3,8,5,8,10]})
df
ORDER GROUP VAL
0 A A_2018_1B1 1
1 A A_2018_1B1H 3
2 A A_2018_1M1 8
3 B B_2018_I000_1C1 5
4 B B_2018_I000_1B1 8
5 B B_2018_I000_1C1H 10
I want to create a column "CAL" as sum of 'VAL' where GROUP name is same for all the rows expect H character in the end. So, for example, 'VAL' column for 1st two rows will be added because the only difference between the 'GROUP' is 2nd row has H in the last. Row 3 will remain as it is, Row 4 and 6 will get added and Row 5 will remain same.
My expected output
ORDER GROUP VAL CAL
0 A A_2018_1B1 1 4
1 A A_2018_1B1H 3 4
2 A A_2018_1M1 8 8
3 B B_2018_I000_1C1 5 15
4 B B_2018_I000_1B1 8 8
5 B B_2018_I000_1C1H 10 15
Try with replace then transform
df.groupby(df.GROUP.str.replace('H','')).VAL.transform('sum')
0 4
1 4
2 8
3 15
4 8
5 15
Name: VAL, dtype: int64
df['CAL'] = df.groupby(df.GROUP.str.replace('H','')).VAL.transform('sum')

Highest frequency in a dataframe

I am looking for a way to get the highest frequency in the entire pandas, not in a particular column. I have looked at value count, but it seems that works in a column specific way. Any way to do that?
Use DataFrame.stack with Series.mode for top values, for first select by position:
df = pd.DataFrame({
'B':[4,5,4,5,4,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
})
a = df.stack().mode().iat[0]
print (a)
4
Or if need also frequency is possible use Series.value_counts:
s = df.stack().value_counts()
print (s)
4 6
5 4
3 3
9 2
7 2
2 2
1 2
8 1
6 1
0 1
dtype: int64
print (s.index[0])
4
print (s.iat[0])
6

How to extract value of column based on value change in other column python

I have dataframe with two columns i want extract value of first column based on second column, if in last 3 rows of column 2 value change from 0 to any value then extract value of column 1.
df=pd.DataFrame({'column1':[1,5,6,7,8,11,12,14,18,20],'column2':[0,0,1,1,0,0,0,256,256,0]})
print(df)
column1 column2
0 1 0
1 5 0
2 6 1
3 7 1
4 8 0
5 11 0
6 12 0
7 14 256
8 18 256
9 20 0
out_put=pd.DataFrame({'column1':[20],'column2':[0]})
print(out_put)
column1 column2
0 20 0
I believe you need check difference with last values to first in last 3 values of second column:
df1 = df.tail(3)
df2 = df1[df1['column2'].eq(0).view('i1').diff().eq(1)]
print (df2)
column1 column2
9 20 0
Details:
#last 3 rows
print (df1)
column1 column2
7 14 256
8 18 256
9 20 0
#compare second colum for equality
print (df1['column2'].eq(0))
7 False
8 False
9 True
Name: column2, dtype: bool
#convert mask to integers
print (df1['column2'].eq(0).view('i1'))
7 0
8 0
9 1
Name: column2, dtype: int8
#get difference
print (df1['column2'].eq(0).view('i1').diff())
Name: column2, dtype: int8
7 NaN
8 0.0
9 1.0
Name: column2, dtype: float64
#compare by 1
print (df1['column2'].eq(0).view('i1').diff().eq(1))
7 False
8 False
9 True
Name: column2, dtype: bool
And last filter by boolean indexing.

Resources