Search for value in all DataFrame columns (except first column !) and add new column with matching column name - python-3.x

I'd like to do a search on all columns (except the first column !) of a DataFrame and add a new column (like 'Column_Match') with the name of the matching column.
I tried something like this:
df.apply(lambda row: row.astype(str).str.contains('my_keyword').any(), axis=1)
But it's not excluding the first column and I don't know how to return and add the column name.
Any help much appreciated !

If want columns name of first matched value per rows add new column for match not exist values by DataFrame.assign and DataFrame.idxmax for column name:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'A':list('abcdef'),
'C':list('akabbe'),
'F':list('eakbbb')
})
f = lambda row: row.astype(str).str.contains('e')
df['new'] = df.iloc[:,1:].apply(f, axis=1).assign(missing=True).idxmax(axis=1)
print (df)
B A C F new
0 4 a a e F
1 5 b k a missing
2 4 c a k missing
3 5 d b b missing
4 5 e b b A
5 4 f e b C
If need all columns names of all matched values create boolean DataFrame and use dot product with columns names by DataFrame.dot and Series.str.rstrip:
f = lambda row: row.astype(str).str.contains('a')
df1 = df.iloc[:,1:].apply(f, axis=1)
df['new'] = df1.dot(df.columns[1:] + ', ').str.rstrip(', ').replace('', 'missing')
print (df)
B A C F new
0 4 a a e A, C
1 5 b k a F
2 4 c a k C
3 5 d b b missing
4 5 e b b missing
5 4 f e b missing

Related

Need help inputting data in excel columns from other columns

If I had data in rows A to E as seen below in the table. Some of the values can be NA. IN column F if i wanted to input data from columns A to E in a way that if data in A exists use that otherwise if data in B exists use that otherwise until column E. If none of them have any values return NA. I would like to automate this where somewhere I just specify the order for example A, B, C, D and E OR A, C, E, D, B and the values in F update according to the reference table
Reference : C - B - A - E - D
a
b
c
d
e
f
3
4
3
2
2
7
1
7
NA
1
4
2
4
2
2
4
2
2
Use FILTER() with # operator.
=#FILTER(A2:E2,A2:E2<>"","NA")
For dynamic array approach (spill results automatically), try-
=BYROW(A2:E7,LAMBDA(x,INDEX(FILTER(x,x<>"","NA"),1,1)))

Pandas dataframe deduplicate rows with column logic

I have a pandas dataframe with about 100 million rows. I am interested in deduplicating it but have some criteria that I haven't been able to find documentation for.
I would like to deduplicate the dataframe, ignoring one column that will differ. If that row is a duplicate, except for that column, I would like to only keep the row that has a specific string, say X.
Sample dataframe:
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
Desired output:
>>> df_dedup
A B C
0 1 2 00X
1 1 3 010
So, alternatively stated, the row index 2 would be removed because row index 0 has the information in columns A and B, and X in column C
As this data is slightly large, I hope to avoid iterating over rows, if possible. Ignore Index is the closest thing I've found to the built-in drop_duplicates().
If there is no X in column C then the row should require that C is identical to be deduplicated.
In the case in which there are matching A and B in a row, but have multiple versions of having an X in C, the following would be expected.
df = pd.DataFrame(columns=["A","B","C"],
data = [[1,2,"0X0"],
[1,2,"X00"],
[1,2,"0X0"]])
Output should be:
>>> df_dedup
A B C
0 1 2 0X0
1 1 2 X00
Use DataFrame.duplicated on columns A and B to create a boolean mask m1 corresponding to condition where values in column A and B are not duplicated, then use Series.str.contains + Series.duplicated on column C to create a boolean mask corresponding to condition where C contains string X and C is not duplicated. Finally using these masks filter the rows in df.
m1 = ~df[['A', 'B']].duplicated()
m2 = df['C'].str.contains('X') & ~df['C'].duplicated()
df = df[m1 | m2]
Result:
#1
A B C
0 1 2 00X
1 1 3 010
#2
A B C
0 1 2 0X0
1 1 2 X00
Does the column "C" always have X as the last character of each value? You could try creating a column D with 1 if column C has an X or 0 if it does not. Then just sort the values using sort_values and finally use drop_duplicates with keep='last'
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
df['D'] = 0
df.loc[df['C'].str[-1] == 'X', 'D'] = 1
df.sort_values(by=['D'], inplace=True)
df.drop_duplicates(subset=['A', 'B'], keep='last', inplace=True)
This is assuming you also want to drop duplicates in case there is no X in the 'C' column among the duplicates of columns A and B
Here is another approach. I left 'count' (a helper column) in for transparency.
# use df as defined above
# count the A,B pairs
df['count'] = df.groupby(['A', 'B']).transform('count').squeeze()
m1 = (df['count'] == 1)
m2 = (df['count'] > 1) & df['C'].str.contains('X') # could be .endswith('X')
print(df.loc[m1 | m2]) # apply masks m1, m2
A B C count
0 1 2 00X 2
1 1 3 010 1

creating a new column from an existent categorical column in python

I have a data frame like this:
ID col
1 a
2 b
3 c
4 d
I want to create a new column so that if it is a or c, new column will give Y, otherwise N.
So, it will look like the following:
ID col col1
1 a Y
2 b N
3 c Y
4 d N
I am working in python3.
Try this code, simple effective
df = pd.DataFrame({'ID':[1,2,3,4],
'Col':['a', 'b', 'c', 'd']})
df['Col_2'] = df.apply(lambda row: 'Y' if (row.Col=='a' or row.Col=='c') else 'N' , axis = 1)

pandas advanced splitting by comma

There have been a lot of posts concerning splitting a single column into multiples, but I couldn't find an answer to a slight modification to the idea of splitting.
When you use str.split, it splits the string independent of order. You can modify it to be slightly more complex, such as ordering it by sorting alphabetically
e.x. dataframe (df)
row
0 a, e, c, b
1 b, d, a
2 a, b, c, d, e
3 d, f
foo = df['row'].str.split(',')
will split based on the comma and return:
0 1 2 3
0 a e c b
....
However that doesn't align the results by their unique value. Even if you use a sort on the split string, it will still only result in this:
0 1 2 3 4 5
0 a b c e
1 a b d
...
whereas I want it to look like this:
0 1 2 3 4 5
0 a b c e
1 a b d
2 a b c d e
...
I know I'm missing something. Do I need to add the columns first and then map the split values to the correct column? What if you don't know all of the unique values? Still learning pandas syntax so any pointers in the right direction would be appreciated.
Using get_dummies
s=df.row.str.get_dummies(sep=' ,')
s.mul(s.columns)
Out[239]:
a b c d e f
0 a b c e
1 a b d
2 a b c d e
3 d f

Excel Formula comparing two columns

Below is a sample of the data I have. I want to match the data in Column A and B. If column B is not matching column A, I want to add a row and copy the data from Column A to B. For example, "4" is missing in column B, so I want to add a space and add "4" to column B so it will match column A. I have a large set of data, so I am trying to find a different way instead of checking for duplicate values in the two columns and manually adding one row at a time. Thanks!
A B C D
3 3 Y B
4 5 G B
5 6 B G
6 8 P G
7 9 Y P
8 11 G Y
9 12 B Y
10
11
12
11
12
I would move col B,C,D to a separate columns, say E,F,G, then using index matches against col A and col B identify which records are missing.
For col C: =IFERROR(INDEX(F:F,Match(A1,E:E,0)),"N/A")
For col D: =IFERROR(INDEX(G:G,Match(A1,E:E,0)),"N/A")
Following this you can filter for C="N/A" to identify cases where a B value is missing for an A value, and manually edit. Since you want A & B to be matching here col B is unnecessary, final result w/ removing col B and C->B, D->C:
A B C
3 Y B
4 N/A N/A
5 G B
6 B G
7 N/A N/A
Hope this helps!

Resources