How to create column based on string position of other column in python? - python-3.x

df:
Col_A
0 011011
1 011011
2 011011
3 011011
4 011011
How to create a column based on string position, for example in column_A I need to check 0 position create a column B.
Output;
Col_A Col_B
0 011011 pos1,pos4
1 000111 pos1,pos2,pos3
2 011000 pos1,pos4,pos5,pos6
3 011111 pos1
4 011010 pos1,pos4,pos6

First convert strings to DataFrame and add columns names by function in rename:
f = lambda x: f'pos{x+1}'
df1 = pd.DataFrame([list(x) for x in df['Col_A']], index=df.index).rename(columns=f)
print (df1)
pos1 pos2 pos3 pos4 pos5 pos6
0 0 1 1 0 1 1
1 0 0 0 1 1 1
2 0 1 1 0 0 0
3 0 1 1 1 1 1
4 0 1 1 0 1 0
Then compare '0' values by DataFrame.eq and for new column use matrix multiplication by DataFrame.dot with remove separator by Series.str.rstrip:
df['Col_B'] = df1.eq('0').dot(df1.columns + ',').str.rstrip(',')
print (df)
Col_A Col_B
0 011011 pos1,pos4
1 000111 pos1,pos2,pos3
2 011000 pos1,pos4,pos5,pos6
3 011111 pos1
4 011010 pos1,pos4,pos6

Related

how to do count of particular value of given column corresponding to other column

To count the particular value of given column
Use pd.crosstab with df.sum:
In [236]: output = pd.crosstab(df['Rel_ID'], df['Values'])
In [238]: output['total'] = output.sum(axis=1)
In [239]: output
Out[239]:
Values 400.0 500.0 1700.0 6300.0 total
Rel_ID
TESTA 1 1 1 1 4
TESTB 1 0 1 1 3
TESTC 0 1 1 0 2
TESTD 1 0 1 1 3
TESTE 1 1 0 0 2

How to identify where a particular sequence in a row occurs for the first time

I have a dataframe in pandas, an example of which is provided below:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6
A 1 0 0 1 0 0
B 1 1 0 0 1 0
C 1 0 1 1 0 0
D 0 0 1 0 0 1
E 1 1 1 1 1 1
As you can see 1 and 0 occurs randomly in different columns. It would be helpful, if anyone can suggest me a code in python such that I am able to find the column number where the 1 0 0 pattern occurs for the first time. For example, for member A, the first 1 0 0 pattern occurs at appear_1. so the first occurrence will be 1. Similarly for the member B, the first 1 0 0 pattern occurs at appear_2, so the first occurrence will be at column 2. The resulting table should have a new column named 'first_occurrence'. If there is no such 1 0 0 pattern occurs (like in row E) then the value in first occurrence column will the sum of number of 1 in that row. The resulting table should look something like this:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 first_occurrence
A 1 0 0 1 0 0 1
B 1 1 0 0 1 0 2
C 1 0 1 1 0 0 4
D 0 0 1 0 0 1 3
E 1 1 1 1 1 1 6
Thank you in advance.
I try not to reinvent the wheel, so I develop on my answer to previous question. From that answer, you need to use additional idxmax, np.where, and get_indexer
cols = ['appear_1', 'appear_2', 'appear_3', 'appear_4', 'appear_5', 'appear_6']
df1 = df[cols]
m = df1[df1.eq(1)].ffill(1).notna()
df2 = df1[m].bfill(1).eq(0)
m2 = df2 & df2.shift(-1, axis=1, fill_value=True)
df['first_occurrence'] = np.where(m2.any(1), df1.columns.get_indexer(m2.idxmax(1)),
df1.shape[1])
Out[540]:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 first_occurrence
0 A 1 0 0 1 0 0 1
1 B 1 1 0 0 1 0 2
2 C 1 0 1 1 0 0 4
3 D 0 0 1 0 0 1 3
4 E 1 1 1 1 1 1 6

List column name having value greater than zero

I have following dataframe
A | B | C | D
1 0 2 1
0 1 1 0
0 0 0 1
I want to add the new column have any value of row in the column greater than zero along with column name
A | B | C | D | New
1 0 2 1 A-1, C-2, D-1
0 1 1 0 B-1, C-1
0 0 0 1 D-1
We can use mask and stack
s=df.mask(df==0).stack().\
astype(int).astype(str).\
reset_index(level=1).apply('-'.join,1).add(',').sum(level=0).str[:-1]
df['New']=s
df
Out[170]:
A B C D New
0 1 0 2 1 A-1,C-2,D-1
1 0 1 1 0 B-1,C-1
2 0 0 0 1 D-1
Combine the column names with the df values that are not zero and then filter out the None values.
df = pd.read_clipboard()
arrays = np.where(df!=0, df.columns.values + '-' + df.values.astype('str'), None)
new = []
for array in arrays:
new.append(list(filter(None, array)))
df['New'] = new
df
Out[1]:
A B C D New
0 1 0 2 1 [A-1, C-2, D-1]
1 0 1 1 0 [B-1, C-1]
2 0 0 0 1 [D-1]

Comparing two different sized pandas Dataframes and to find the row index with equal values

I need some help with comparing two pandas dataframe
I have two dataframes
The first dataframe is
df1 =
a b c d
0 1 1 1 1
1 0 1 0 1
2 0 0 0 1
3 1 1 1 1
4 1 0 1 0
5 1 1 1 0
6 0 0 1 0
7 0 1 0 1
and the second dataframe is
df2 =
a b c d
0 1 1 1 1
1 1 0 1 0
2 0 0 1 0
I want to find the row index of dataframe 1 (df1) which the entire row is the same as the rows in dataframe 2 (df2). My expect result would be
0
3
4
6
The order of the above index does not need to be in order, all I want is the index of dataframe 1 (df1)
Is there a way without using for loop?
Thanks
Tommy
You can using merge
df1.merge(df2,indicator=True,how='left').loc[lambda x : x['_merge']=='both'].index
Out[459]: Int64Index([0, 3, 4, 6], dtype='int64')

drop duplicate rows from dataframe based on column precedence - python

If I have a database
Example:
Name A B C
0 Jon 0 1 0
1 Jon 1 0 1
2 Alan 1 0 0
3 Shaya 0 1 1
If there is a duplicate in my dataset I want the person who has column A as 1 to have precedence.
NB. Column A can only have values 1 or 0
Output:
Name A B C
1 Jon 1 0 1
2 Alan 1 0 0
3 Shaya 0 1 1
IIUC sort value before drop duplicate
df.sort_values('A').drop_duplicates('Name',keep='last').sort_index()
Out[126]:
Name A B C
1 Jon 1 0 1
2 Alan 1 0 0
3 Shaya 0 1 1

Resources