how to get specific column names from a pandas DF that satisfies a condition - python-3.x

I have data as follow:
col1 col2 col3 col4 col5
0 1 0 1 0
1 1 0 0 1
1 1 1 0 1
I want it as below:-
col1 col2 col3 col4 col5 col6
0 1 0 1 0 col2,col4
1 1 0 0 1 col2,col4,col5
1 1 1 0 1 col1,col2,col3,col5
Whereever the value is 1, the column name should be appended in col 6. I tried idx.max(), however its not working may be because there are more than one column which satisfies the condition. Can anyone please help?

You can do a matrix multiplication here:
(df # (df.columns + ',')).str[:-1]
Output:
col1 col2 col3 col4 col5 col6
0 0 1 0 1 0 col2,col4
1 1 1 0 0 1 col1,col2,col5
2 1 1 1 0 1 col1,col2,col3,col5

Related

excel find the count of 2 filtered columns

There are paired columns that I am comparing(col1 and col2, col3 and col4) with either blank or '0' or '1'. I basically want to know how many are intersect
id col1 col2 col3 col4
id1 0 1
id2 1 1 0
id3 0 1 1
id4
id5 0
for this table I want to count of how many ids are 0 or 1(between col1 and col2). If I use countA(b2:c4) I get 4 but I need to get 3 as only 3 ids are affected for each pair
. Is therea formula that would actually give 3 for col1 and col2 and 3 for col3 and col4.
SUMPRODUCT(--(B$2:B$7+C$2:C$7=0))
fails here and provides 3 instead of 5

How to evaluate multiple columns on pandas?

I have the following pandas dataframe:
col1 col2 col3 .... colN
5 2 4 .... 9
1 2 3 .... 9
7 1 4 .... 0
1 4 7 .... 8
What I need is a way to determinate the order between several columns:
col1 col2 col3 .... colN
5 2 4 .... 9 ----> colN >= ... >= col5 >= col2 >= col3
1 2 3 .... 9 ----> colN >= ... >= col3 >= col2 >= col1
7 1 4 .... 0 ----> col1 >= ... >= col3 >= col2 >= colN
1 4 7 .... 8 ----> colN >= ... >= col3 >= col2 >= col1
And give them a numeric alias. For example:
colN >= ... >= col5 >= col2 >= col3 = X
colN >= ... >= col3 >= col2 >= col1 = Y
col1 >= ... >= col3 >= col2 >= colN = Z
:
:
col1 col2 col3 .... colN order
5 2 4 .... 9 X
1 2 3 .... 9 Y
7 1 4 .... 0 Z
1 4 7 .... 8 Y
:
:
The number of columns may change and the alias has to follow a patron. Example with 3 columns:
col1 >= col2 >= col3 = 1
col1 >= col3 >= col2 = 2
col2 >= col1 >= col3 = 3
col2 >= col3 >= col2 = 4
col3 >= col1 >= col2 = 5
col3 >= col2 >= col1 = 6
Thanks and regards
You can use:
df['order'] = df.apply(lambda x: '>='.join(x.sort_values(ascending=False).index), axis=1)
df['alias'] = df.groupby('order').ngroup() + 1
Input
col1 col2 col3
0 5 2 4
1 1 2 3
2 7 1 4
3 1 4 7
Output:
col1 col2 col3 order alias
0 5 2 4 col1>=col3>=col2 1
1 1 2 3 col3>=col2>=col1 2
2 7 1 4 col1>=col3>=col2 1
3 1 4 7 col3>=col2>=col1 2
Or for specific pattern:
alias_pattern = {'col1>=col3>=col2' : 2, 'col3>=col2>=col1' : 5}
df['alias'] = df['order'].map(alias_pattern)
Output:
col1 col2 col3 order alias
0 5 2 4 col1>=col3>=col2 2
1 1 2 3 col3>=col2>=col1 5
2 7 1 4 col1>=col3>=col2 2
3 1 4 7 col3>=col2>=col1 5

Pandas: Create different dataframes from an unique multiIndex dataframe

I would like to know how to pass from a multiindex dataframe like this:
A B
col1 col2 col1 col2
1 2 12 21
3 1 2 0
To two separated dfs. df_A:
col1 col2
1 2
3 1
df_B:
col1 col2
12 21
2 0
Thank you for the help
I think here is better use DataFrame.xs for selecting by first level:
print (df.xs('A', axis=1, level=0))
col1 col2
0 1 2
1 3 1
What need is not recommended, but possible create DataFrames by groups:
for i, g in df.groupby(level=0, axis=1):
globals()['df_' + str(i)] = g.droplevel(level=0, axis=1)
print (df_A)
col1 col2
0 1 2
1 3 1
Better is create dictionary of DataFrames:
d = {i:g.droplevel(level=0, axis=1)for i, g in df.groupby(level=0, axis=1)}
print (d['A'])
col1 col2
0 1 2
1 3 1

How to merge tab separated data (always starting with letters) into one string?

I have the following data in a file:
col1 col2 col3 col4 col5 col6
ABC DEF GE-10 0 0 12 4 16 0
HIJ KLM 7 0 123 40 0 0
NOP QL 17 0 0 6 10 1
I want to merge all text information into one string (with _ between) so that it looks like in this:
col1 col2 col3 col4 col5 col6
ABC_DEF_GE-10 0 0 12 4 16 0
HIJ_KLM 7 0 123 40 0 0
NOP_QL 17 0 0 6 10 1
The issue is that the text information to be merged exists in col 1-2 for some rows and in col 1-3 in some rows.
How is this accomplished in Bash?
test.js
#!/bin/bash.
file='read_file.txt'
#Reading each line
while read line; do
#Reading each word
wordString=""
count=1
for word in $line; do
if [[ $word =~ ^[0-9]+$ ]];then
#starts with a numberic value
wordString="${wordString} ${word}"
else
#doesn't starts with a numberic value
wordString="${wordString}_${word}"
fi
done
#remove first character and print the line
echo ${wordString#?}
done < $file
put this in the below file in the same directory
read_file.txt
col1 col2 col3 col4 col5 col6
ABC DEF GE-10 0 0 12 4 16 0
HIJ KLM 7 0 123 40 0 0
NOP QL 17 0 0 6 10 1

Pandas dataframe drop rows which store certain number of zeros in it

Hello I have dataframe which is having [13171 rows x 511 columns] what I wanted is remove the rows which is having certain number of zeros
for example
col0 col1 col2 col3 col4 col5
ID1 0 2 0 2 0
ID2 1 1 2 10 1
ID3 0 1 3 4 0
ID4 0 0 1 0 3
ID5 0 0 0 0 1
in ID5 row contains 4 zeros in it so I wanted to drop that row. like this I have large dataframe which is having more than 100-300 zeros in rows
I tried below code
df=df[(df == 0).sum(1) >= 4]
for small dataset like above example code is working but for [13171 rows x 511 columns] not working(df=df[(df == 0).sum(1) >= 15]) any one suggest me how can I get proper result
output
col0 col1 col2 col3 col4 col5
ID1 0 2 0 2 0
ID2 1 1 2 10 1
ID3 0 1 3 4 0
ID4 0 0 1 0 3
This will work:
drop_indexs = []
for i in range(len(df.iloc[:,0])):
if (df.iloc[i,:]==0).sum()>=4: # 4 is how many zeros should row min have
drop_indexs.append(i)
updated_df = df.drop(drop_indexs)

Resources