How to merge tab separated data (always starting with letters) into one string? - string

I have the following data in a file:
col1 col2 col3 col4 col5 col6
ABC DEF GE-10 0 0 12 4 16 0
HIJ KLM 7 0 123 40 0 0
NOP QL 17 0 0 6 10 1
I want to merge all text information into one string (with _ between) so that it looks like in this:
col1 col2 col3 col4 col5 col6
ABC_DEF_GE-10 0 0 12 4 16 0
HIJ_KLM 7 0 123 40 0 0
NOP_QL 17 0 0 6 10 1
The issue is that the text information to be merged exists in col 1-2 for some rows and in col 1-3 in some rows.
How is this accomplished in Bash?

test.js
#!/bin/bash.
file='read_file.txt'
#Reading each line
while read line; do
#Reading each word
wordString=""
count=1
for word in $line; do
if [[ $word =~ ^[0-9]+$ ]];then
#starts with a numberic value
wordString="${wordString} ${word}"
else
#doesn't starts with a numberic value
wordString="${wordString}_${word}"
fi
done
#remove first character and print the line
echo ${wordString#?}
done < $file
put this in the below file in the same directory
read_file.txt
col1 col2 col3 col4 col5 col6
ABC DEF GE-10 0 0 12 4 16 0
HIJ KLM 7 0 123 40 0 0
NOP QL 17 0 0 6 10 1

Related

How to add a column in pandas dataframe based on other columns for large dataset?

I have a CSV file that contains 1,000,000 rows and columns like following.
col1 col2 col3...col20
0 1 0 ... 10
0 1 0 ... 20
1 0 0 ... 30
0 1 0 ... 40
0 0 1 ... 50
................
I want to add a new column called col1_col2_col3 like following.
col1 col2 col3 col1_col2_col3 ...col20
0 1 0 2 ... 10
0 1 0 2 ... 20
1 0 0 1 ... 30
0 1 0 2 ... 40
0 0 1 3 ... 50
.................
I have loaded the data file in a pandas data frame. Then tried following.
for idx, row in df.iterrows():
if (df.loc[idx, 'col1'] == 1):
df.loc[idx, 'col1_col2_col3'] = 1
elif (df.loc[idx, 'col2'] == 1)
df.loc[idx, 'col1_col2_col3'] = 2
elif (df.loc[idx, 'col3'] == 1)
df.loc[idx, 'col1_col2_col3'] = 3
The above solution wroks. However, my code is taking very long time to run. Is there any way to create col1_col2_col3 fast?
Here's one way using multiplication. The idea is to multiply each column by 1, 2 or 3 depending on which column it is, then keep the nonzero values:
df['col1_col2_col3'] = df[['col1','col2','col3']].mul([1,2,3]).mask(lambda x: x==0).bfill(axis=1)['col1'].astype(int)
N.B. It assumes that each row can have only one nonzero value in columns ['col1_col2_col3'].
Output:
col1 col2 col3 ... col20 col1_col2_col3
0 0 1 0 ... 10 2
1 0 1 0 ... 20 2
2 1 0 0 ... 30 1
3 0 1 0 ... 40 2
4 0 0 1 ... 50 3
You can use Numpy's argmax
df.assign(
col1_col2_col3=
df[['col1', 'col2', 'col3']].to_numpy().argmax(axis=1) + 1
)
col1 col2 col3 col20 col1_col2_col3
0 0 1 0 10 2
1 0 1 0 20 2
2 1 0 0 30 1
3 0 1 0 40 2
4 0 0 1 50 3

Pandas dataframe drop rows which store certain number of zeros in it

Hello I have dataframe which is having [13171 rows x 511 columns] what I wanted is remove the rows which is having certain number of zeros
for example
col0 col1 col2 col3 col4 col5
ID1 0 2 0 2 0
ID2 1 1 2 10 1
ID3 0 1 3 4 0
ID4 0 0 1 0 3
ID5 0 0 0 0 1
in ID5 row contains 4 zeros in it so I wanted to drop that row. like this I have large dataframe which is having more than 100-300 zeros in rows
I tried below code
df=df[(df == 0).sum(1) >= 4]
for small dataset like above example code is working but for [13171 rows x 511 columns] not working(df=df[(df == 0).sum(1) >= 15]) any one suggest me how can I get proper result
output
col0 col1 col2 col3 col4 col5
ID1 0 2 0 2 0
ID2 1 1 2 10 1
ID3 0 1 3 4 0
ID4 0 0 1 0 3
This will work:
drop_indexs = []
for i in range(len(df.iloc[:,0])):
if (df.iloc[i,:]==0).sum()>=4: # 4 is how many zeros should row min have
drop_indexs.append(i)
updated_df = df.drop(drop_indexs)

line feed inside row in column with pandas

are there any way in pandas to separate data inside a row in a column? row have multiple data, I mean, I group by col1 and the result is that I have a df like that:
col1 Col2
0 1 abc,def,ghi
1 2 xyz,asd
and desired output would be:
Col1 Col2
0 1 abc
def
ghi
1 2 xyz
asd
thanks
Use str.split and explode:
print (df.assign(Col2=df["Col2"].str.split(","))
.explode("Col2"))
col1 Col2
0 1 abc
0 1 def
0 1 ghi
1 2 xyz
1 2 asd

how to get specific column names from a pandas DF that satisfies a condition

I have data as follow:
col1 col2 col3 col4 col5
0 1 0 1 0
1 1 0 0 1
1 1 1 0 1
I want it as below:-
col1 col2 col3 col4 col5 col6
0 1 0 1 0 col2,col4
1 1 0 0 1 col2,col4,col5
1 1 1 0 1 col1,col2,col3,col5
Whereever the value is 1, the column name should be appended in col 6. I tried idx.max(), however its not working may be because there are more than one column which satisfies the condition. Can anyone please help?
You can do a matrix multiplication here:
(df # (df.columns + ',')).str[:-1]
Output:
col1 col2 col3 col4 col5 col6
0 0 1 0 1 0 col2,col4
1 1 1 0 0 1 col1,col2,col5
2 1 1 1 0 1 col1,col2,col3,col5

pandas fill column with random numbers with a total for each row

I've got a pandas dataframe like this:
id foo
0 A col1
1 A col2
2 B col1
3 B col3
4 D col4
5 C col2
I'd like to create four additional columns based on unique values in foo column. col1,col2, col3, col4
id foo col1 col2 col3 col4
0 A col1 75 20 5 0
1 A col2 20 80 0 0
2 B col1 82 10 8 0
3 B col3 5 4 80 11
4 D col4 0 5 10 85
5 C col2 12 78 5 5
The logic for creating the columns is as follows:
if foo = col1 then col1 contains a random number between 75-100 and the other columns (col2, col3, col4) contains random numbers, such that the total for each row is 100
I can manually create a new column and assign a random number, but I'm unsure how to include the logic of sum for each row of 100.
Appreciate any help!
My two cents
d=[]
s=np.random.randint(75,100,size=6)
for x in 100-s:
a=np.random.randint(100, size=3)
b=np.random.multinomial(x, a /a.sum())
d.append(b.tolist())
s=[np.random.choice(x,4,replace= False) for x in np.column_stack((s,np.array(d))) ]
df=pd.concat([df,pd.DataFrame(s,index=df.index)],1)
df
id foo 0 1 2 3
0 A col1 16 1 7 76
1 A col2 4 2 91 3
2 B col1 4 4 1 91
3 B col3 78 8 8 6
4 D col4 8 87 3 2
5 C col2 2 0 11 87
IIUC,
df['col1'] = df.apply(lambda x: np.where(x['foo'] == 'col1', np.random.randint(75,100), np.random.randint(0,100)), axis=1)
df['col2'] = df.apply(lambda x: np.random.randint(0,100-x['col1'],1)[0], axis=1)
df['col3'] = df.apply(lambda x: np.random.randint(0,100-x[['col1','col2']].sum(),1)[0], axis=1)
df['col4'] = 100 - df[['col1','col2','col3']].sum(1).astype(int)
df[['col1','col2','col3','col4']].sum(1)
Output:
id foo col1 col2 col3 col4
0 A col1 92 2 5 1
1 A col2 60 30 0 10
2 B col1 89 7 3 1
3 B col3 72 12 0 16
4 D col4 41 52 3 4
5 C col2 72 2 22 4
My Approach
import numpy as np
def weird(lower, upper, k, col, cols):
first_num = np.random.randint(lower, upper)
delta = upper - first_num
the_rest = np.random.rand(k - 1)
the_rest = the_rest / the_rest.sum() * (delta)
the_rest = the_rest.astype(int)
the_rest[-1] = delta - the_rest[:-1].sum()
key = lambda x: x != col
return dict(zip(sorted(cols, key=key), [first_num, *the_rest]))
def f(c): return weird(75, 100, 4, c, ['col1', 'col2', 'col3', 'col4'])
df.join(pd.DataFrame([*map(f, df.foo)]))
id foo col1 col2 col3 col4
0 A col1 76 2 21 1
1 A col2 11 76 11 2
2 B col1 75 4 10 11
3 B col3 0 1 97 2
4 D col4 5 4 13 78
5 C col2 9 77 6 8
If we subtract the numbers between 75-100 by 75, the problem become generating a table of random number between 0-25 whose each row sums to 25. That can be solve by reverse cumsum:
num_cols = 4
# generate random number and sort them in each row
a = np.sort(np.random.randint(0,25, (len(df), num_cols)), axis=1)
# create a dataframe and attach a last column with values 25
new_df = pd.DataFrame(a)
new_df[num_cols] = 25
# compute the difference, which are our numbers and add to the dummies:
dummies = pd.get_dummies(df.foo) * 75
dummies += new_df.diff(axis=1).fillna(new_df[0]).values
And dummies is
col1 col2 col3 col4
0 76.0 13.0 2.0 9.0
1 1.0 79.0 2.0 4.0
2 76.0 5.0 8.0 9.0
3 1.0 3.0 79.0 10.0
4 1.0 2.0 1.0 88.0
5 1.0 82.0 1.0 7.0
which can be concatenated to the original dataframe.

Resources