Pandas dataframe drop rows which store certain number of zeros in it - python-3.x

Hello I have dataframe which is having [13171 rows x 511 columns] what I wanted is remove the rows which is having certain number of zeros
for example
col0 col1 col2 col3 col4 col5
ID1 0 2 0 2 0
ID2 1 1 2 10 1
ID3 0 1 3 4 0
ID4 0 0 1 0 3
ID5 0 0 0 0 1
in ID5 row contains 4 zeros in it so I wanted to drop that row. like this I have large dataframe which is having more than 100-300 zeros in rows
I tried below code
df=df[(df == 0).sum(1) >= 4]
for small dataset like above example code is working but for [13171 rows x 511 columns] not working(df=df[(df == 0).sum(1) >= 15]) any one suggest me how can I get proper result
output
col0 col1 col2 col3 col4 col5
ID1 0 2 0 2 0
ID2 1 1 2 10 1
ID3 0 1 3 4 0
ID4 0 0 1 0 3

This will work:
drop_indexs = []
for i in range(len(df.iloc[:,0])):
if (df.iloc[i,:]==0).sum()>=4: # 4 is how many zeros should row min have
drop_indexs.append(i)
updated_df = df.drop(drop_indexs)

Related

excel find the count of 2 filtered columns

There are paired columns that I am comparing(col1 and col2, col3 and col4) with either blank or '0' or '1'. I basically want to know how many are intersect
id col1 col2 col3 col4
id1 0 1
id2 1 1 0
id3 0 1 1
id4
id5 0
for this table I want to count of how many ids are 0 or 1(between col1 and col2). If I use countA(b2:c4) I get 4 but I need to get 3 as only 3 ids are affected for each pair
. Is therea formula that would actually give 3 for col1 and col2 and 3 for col3 and col4.
SUMPRODUCT(--(B$2:B$7+C$2:C$7=0))
fails here and provides 3 instead of 5

Pandas: Create different dataframes from an unique multiIndex dataframe

I would like to know how to pass from a multiindex dataframe like this:
A B
col1 col2 col1 col2
1 2 12 21
3 1 2 0
To two separated dfs. df_A:
col1 col2
1 2
3 1
df_B:
col1 col2
12 21
2 0
Thank you for the help
I think here is better use DataFrame.xs for selecting by first level:
print (df.xs('A', axis=1, level=0))
col1 col2
0 1 2
1 3 1
What need is not recommended, but possible create DataFrames by groups:
for i, g in df.groupby(level=0, axis=1):
globals()['df_' + str(i)] = g.droplevel(level=0, axis=1)
print (df_A)
col1 col2
0 1 2
1 3 1
Better is create dictionary of DataFrames:
d = {i:g.droplevel(level=0, axis=1)for i, g in df.groupby(level=0, axis=1)}
print (d['A'])
col1 col2
0 1 2
1 3 1

Add rows to dataframe using the string values from column

I want to add rows to a dataframe based on a columns values for each row so a string value of (1:2:3) will create a new column and add rows for that column as described in the example below:
I have this kind of data:
Col1 | Col2
1 | 1:2:3
2 | 4:5
I want to transform it to look like this:
Col1 | Col2
1 | 1
1 | 2
1 | 3
2 | 4
2 | 5
I know that this can be done using nested for loops, but I'm sure there's a better way to do it.
Do split and explode
df=df.assign(Col2=df.Col2.str.split(':')).explode('Col2')
Out[161]:
Col1 Col2
0 1 1
0 1 2
0 1 3
1 2 4
1 2 5
df = pd.DataFrame({'Col1':[1,2],'Col2':['1:2:3','4:5']})
Split the values in Col2 so they are lists and explode.
>>> df['Col2'] = df.apply(lambda x: x['Col2'].split(':'), axis = 1)
>>> df.explode('Col2')
Col1 Col2
0 1 1
0 1 2
0 1 3
1 2 4
1 2 5

how to get specific column names from a pandas DF that satisfies a condition

I have data as follow:
col1 col2 col3 col4 col5
0 1 0 1 0
1 1 0 0 1
1 1 1 0 1
I want it as below:-
col1 col2 col3 col4 col5 col6
0 1 0 1 0 col2,col4
1 1 0 0 1 col2,col4,col5
1 1 1 0 1 col1,col2,col3,col5
Whereever the value is 1, the column name should be appended in col 6. I tried idx.max(), however its not working may be because there are more than one column which satisfies the condition. Can anyone please help?
You can do a matrix multiplication here:
(df # (df.columns + ',')).str[:-1]
Output:
col1 col2 col3 col4 col5 col6
0 0 1 0 1 0 col2,col4
1 1 1 0 0 1 col1,col2,col5
2 1 1 1 0 1 col1,col2,col3,col5

pandas fill column with random numbers with a total for each row

I've got a pandas dataframe like this:
id foo
0 A col1
1 A col2
2 B col1
3 B col3
4 D col4
5 C col2
I'd like to create four additional columns based on unique values in foo column. col1,col2, col3, col4
id foo col1 col2 col3 col4
0 A col1 75 20 5 0
1 A col2 20 80 0 0
2 B col1 82 10 8 0
3 B col3 5 4 80 11
4 D col4 0 5 10 85
5 C col2 12 78 5 5
The logic for creating the columns is as follows:
if foo = col1 then col1 contains a random number between 75-100 and the other columns (col2, col3, col4) contains random numbers, such that the total for each row is 100
I can manually create a new column and assign a random number, but I'm unsure how to include the logic of sum for each row of 100.
Appreciate any help!
My two cents
d=[]
s=np.random.randint(75,100,size=6)
for x in 100-s:
a=np.random.randint(100, size=3)
b=np.random.multinomial(x, a /a.sum())
d.append(b.tolist())
s=[np.random.choice(x,4,replace= False) for x in np.column_stack((s,np.array(d))) ]
df=pd.concat([df,pd.DataFrame(s,index=df.index)],1)
df
id foo 0 1 2 3
0 A col1 16 1 7 76
1 A col2 4 2 91 3
2 B col1 4 4 1 91
3 B col3 78 8 8 6
4 D col4 8 87 3 2
5 C col2 2 0 11 87
IIUC,
df['col1'] = df.apply(lambda x: np.where(x['foo'] == 'col1', np.random.randint(75,100), np.random.randint(0,100)), axis=1)
df['col2'] = df.apply(lambda x: np.random.randint(0,100-x['col1'],1)[0], axis=1)
df['col3'] = df.apply(lambda x: np.random.randint(0,100-x[['col1','col2']].sum(),1)[0], axis=1)
df['col4'] = 100 - df[['col1','col2','col3']].sum(1).astype(int)
df[['col1','col2','col3','col4']].sum(1)
Output:
id foo col1 col2 col3 col4
0 A col1 92 2 5 1
1 A col2 60 30 0 10
2 B col1 89 7 3 1
3 B col3 72 12 0 16
4 D col4 41 52 3 4
5 C col2 72 2 22 4
My Approach
import numpy as np
def weird(lower, upper, k, col, cols):
first_num = np.random.randint(lower, upper)
delta = upper - first_num
the_rest = np.random.rand(k - 1)
the_rest = the_rest / the_rest.sum() * (delta)
the_rest = the_rest.astype(int)
the_rest[-1] = delta - the_rest[:-1].sum()
key = lambda x: x != col
return dict(zip(sorted(cols, key=key), [first_num, *the_rest]))
def f(c): return weird(75, 100, 4, c, ['col1', 'col2', 'col3', 'col4'])
df.join(pd.DataFrame([*map(f, df.foo)]))
id foo col1 col2 col3 col4
0 A col1 76 2 21 1
1 A col2 11 76 11 2
2 B col1 75 4 10 11
3 B col3 0 1 97 2
4 D col4 5 4 13 78
5 C col2 9 77 6 8
If we subtract the numbers between 75-100 by 75, the problem become generating a table of random number between 0-25 whose each row sums to 25. That can be solve by reverse cumsum:
num_cols = 4
# generate random number and sort them in each row
a = np.sort(np.random.randint(0,25, (len(df), num_cols)), axis=1)
# create a dataframe and attach a last column with values 25
new_df = pd.DataFrame(a)
new_df[num_cols] = 25
# compute the difference, which are our numbers and add to the dummies:
dummies = pd.get_dummies(df.foo) * 75
dummies += new_df.diff(axis=1).fillna(new_df[0]).values
And dummies is
col1 col2 col3 col4
0 76.0 13.0 2.0 9.0
1 1.0 79.0 2.0 4.0
2 76.0 5.0 8.0 9.0
3 1.0 3.0 79.0 10.0
4 1.0 2.0 1.0 88.0
5 1.0 82.0 1.0 7.0
which can be concatenated to the original dataframe.

Resources