How to merge tab separated data (always starting with letters) into one string?

How to merge tab separated data (always starting with letters) into one string? - string

I have the following data in a file:
col1 col2 col3 col4 col5 col6
ABC DEF GE-10 0 0 12 4 16 0
HIJ KLM 7 0 123 40 0 0
NOP QL 17 0 0 6 10 1
I want to merge all text information into one string (with _ between) so that it looks like in this:
col1 col2 col3 col4 col5 col6
ABC_DEF_GE-10 0 0 12 4 16 0
HIJ_KLM 7 0 123 40 0 0
NOP_QL 17 0 0 6 10 1
The issue is that the text information to be merged exists in col 1-2 for some rows and in col 1-3 in some rows.
How is this accomplished in Bash?

test.js
#!/bin/bash.
file='read_file.txt'
#Reading each line
while read line; do
#Reading each word
wordString=""
count=1
for word in $line; do
if [[ $word =~ ^[0-9]+$ ]];then
#starts with a numberic value
wordString="${wordString} ${word}"
else
#doesn't starts with a numberic value
wordString="${wordString}_${word}"
fi
done
#remove first character and print the line
echo ${wordString#?}
done < $file
put this in the below file in the same directory
read_file.txt
col1 col2 col3 col4 col5 col6
ABC DEF GE-10 0 0 12 4 16 0
HIJ KLM 7 0 123 40 0 0
NOP QL 17 0 0 6 10 1

Related

How to add a column in pandas dataframe based on other columns for large dataset?

I have a CSV file that contains 1,000,000 rows and columns like following.
col1 col2 col3...col20
0 1 0 ... 10
0 1 0 ... 20
1 0 0 ... 30
0 1 0 ... 40
0 0 1 ... 50
................
I want to add a new column called col1_col2_col3 like following.
col1 col2 col3 col1_col2_col3 ...col20
0 1 0 2 ... 10
0 1 0 2 ... 20
1 0 0 1 ... 30
0 1 0 2 ... 40
0 0 1 3 ... 50
.................
I have loaded the data file in a pandas data frame. Then tried following.
for idx, row in df.iterrows():
if (df.loc[idx, 'col1'] == 1):
df.loc[idx, 'col1_col2_col3'] = 1
elif (df.loc[idx, 'col2'] == 1)
df.loc[idx, 'col1_col2_col3'] = 2
elif (df.loc[idx, 'col3'] == 1)
df.loc[idx, 'col1_col2_col3'] = 3
The above solution wroks. However, my code is taking very long time to run. Is there any way to create col1_col2_col3 fast?

Here's one way using multiplication. The idea is to multiply each column by 1, 2 or 3 depending on which column it is, then keep the nonzero values:
df['col1_col2_col3'] = df[['col1','col2','col3']].mul([1,2,3]).mask(lambda x: x==0).bfill(axis=1)['col1'].astype(int)
N.B. It assumes that each row can have only one nonzero value in columns ['col1_col2_col3'].
Output:
col1 col2 col3 ... col20 col1_col2_col3
0 0 1 0 ... 10 2
1 0 1 0 ... 20 2
2 1 0 0 ... 30 1
3 0 1 0 ... 40 2
4 0 0 1 ... 50 3

You can use Numpy's argmax
df.assign(
col1_col2_col3=
df[['col1', 'col2', 'col3']].to_numpy().argmax(axis=1) + 1
)
col1 col2 col3 col20 col1_col2_col3
0 0 1 0 10 2
1 0 1 0 20 2
2 1 0 0 30 1
3 0 1 0 40 2
4 0 0 1 50 3

Pandas dataframe drop rows which store certain number of zeros in it

Hello I have dataframe which is having [13171 rows x 511 columns] what I wanted is remove the rows which is having certain number of zeros
for example
col0 col1 col2 col3 col4 col5
ID1 0 2 0 2 0
ID2 1 1 2 10 1
ID3 0 1 3 4 0
ID4 0 0 1 0 3
ID5 0 0 0 0 1
in ID5 row contains 4 zeros in it so I wanted to drop that row. like this I have large dataframe which is having more than 100-300 zeros in rows
I tried below code
df=df[(df == 0).sum(1) >= 4]
for small dataset like above example code is working but for [13171 rows x 511 columns] not working(df=df[(df == 0).sum(1) >= 15]) any one suggest me how can I get proper result
output
col0 col1 col2 col3 col4 col5
ID1 0 2 0 2 0
ID2 1 1 2 10 1
ID3 0 1 3 4 0
ID4 0 0 1 0 3

This will work:
drop_indexs = []
for i in range(len(df.iloc[:,0])):
if (df.iloc[i,:]==0).sum()>=4: # 4 is how many zeros should row min have
drop_indexs.append(i)
updated_df = df.drop(drop_indexs)

line feed inside row in column with pandas

are there any way in pandas to separate data inside a row in a column? row have multiple data, I mean, I group by col1 and the result is that I have a df like that:
col1 Col2
0 1 abc,def,ghi
1 2 xyz,asd
and desired output would be:
Col1 Col2
0 1 abc
def
ghi
1 2 xyz
asd
thanks

Use str.split and explode:
print (df.assign(Col2=df["Col2"].str.split(","))
.explode("Col2"))
col1 Col2
0 1 abc
0 1 def
0 1 ghi
1 2 xyz
1 2 asd

how to get specific column names from a pandas DF that satisfies a condition

I have data as follow:
col1 col2 col3 col4 col5
0 1 0 1 0
1 1 0 0 1
1 1 1 0 1
I want it as below:-
col1 col2 col3 col4 col5 col6
0 1 0 1 0 col2,col4
1 1 0 0 1 col2,col4,col5
1 1 1 0 1 col1,col2,col3,col5
Whereever the value is 1, the column name should be appended in col 6. I tried idx.max(), however its not working may be because there are more than one column which satisfies the condition. Can anyone please help?

You can do a matrix multiplication here:
(df # (df.columns + ',')).str[:-1]
Output:
col1 col2 col3 col4 col5 col6
0 0 1 0 1 0 col2,col4
1 1 1 0 0 1 col1,col2,col5
2 1 1 1 0 1 col1,col2,col3,col5

pandas fill column with random numbers with a total for each row

I've got a pandas dataframe like this:
id foo
0 A col1
1 A col2
2 B col1
3 B col3
4 D col4
5 C col2
I'd like to create four additional columns based on unique values in foo column. col1,col2, col3, col4
id foo col1 col2 col3 col4
0 A col1 75 20 5 0
1 A col2 20 80 0 0
2 B col1 82 10 8 0
3 B col3 5 4 80 11
4 D col4 0 5 10 85
5 C col2 12 78 5 5
The logic for creating the columns is as follows:
if foo = col1 then col1 contains a random number between 75-100 and the other columns (col2, col3, col4) contains random numbers, such that the total for each row is 100
I can manually create a new column and assign a random number, but I'm unsure how to include the logic of sum for each row of 100.
Appreciate any help!

My two cents
d=[]
s=np.random.randint(75,100,size=6)
for x in 100-s:
a=np.random.randint(100, size=3)
b=np.random.multinomial(x, a /a.sum())
d.append(b.tolist())
s=[np.random.choice(x,4,replace= False) for x in np.column_stack((s,np.array(d))) ]
df=pd.concat([df,pd.DataFrame(s,index=df.index)],1)
df
id foo 0 1 2 3
0 A col1 16 1 7 76
1 A col2 4 2 91 3
2 B col1 4 4 1 91
3 B col3 78 8 8 6
4 D col4 8 87 3 2
5 C col2 2 0 11 87

IIUC,
df['col1'] = df.apply(lambda x: np.where(x['foo'] == 'col1', np.random.randint(75,100), np.random.randint(0,100)), axis=1)
df['col2'] = df.apply(lambda x: np.random.randint(0,100-x['col1'],1)[0], axis=1)
df['col3'] = df.apply(lambda x: np.random.randint(0,100-x[['col1','col2']].sum(),1)[0], axis=1)
df['col4'] = 100 - df[['col1','col2','col3']].sum(1).astype(int)
df[['col1','col2','col3','col4']].sum(1)
Output:
id foo col1 col2 col3 col4
0 A col1 92 2 5 1
1 A col2 60 30 0 10
2 B col1 89 7 3 1
3 B col3 72 12 0 16
4 D col4 41 52 3 4
5 C col2 72 2 22 4

My Approach
import numpy as np
def weird(lower, upper, k, col, cols):
first_num = np.random.randint(lower, upper)
delta = upper - first_num
the_rest = np.random.rand(k - 1)
the_rest = the_rest / the_rest.sum() * (delta)
the_rest = the_rest.astype(int)
the_rest[-1] = delta - the_rest[:-1].sum()
key = lambda x: x != col
return dict(zip(sorted(cols, key=key), [first_num, *the_rest]))
def f(c): return weird(75, 100, 4, c, ['col1', 'col2', 'col3', 'col4'])
df.join(pd.DataFrame([*map(f, df.foo)]))
id foo col1 col2 col3 col4
0 A col1 76 2 21 1
1 A col2 11 76 11 2
2 B col1 75 4 10 11
3 B col3 0 1 97 2
4 D col4 5 4 13 78
5 C col2 9 77 6 8

If we subtract the numbers between 75-100 by 75, the problem become generating a table of random number between 0-25 whose each row sums to 25. That can be solve by reverse cumsum:
num_cols = 4
# generate random number and sort them in each row
a = np.sort(np.random.randint(0,25, (len(df), num_cols)), axis=1)
# create a dataframe and attach a last column with values 25
new_df = pd.DataFrame(a)
new_df[num_cols] = 25
# compute the difference, which are our numbers and add to the dummies:
dummies = pd.get_dummies(df.foo) * 75
dummies += new_df.diff(axis=1).fillna(new_df[0]).values
And dummies is
col1 col2 col3 col4
0 76.0 13.0 2.0 9.0
1 1.0 79.0 2.0 4.0
2 76.0 5.0 8.0 9.0
3 1.0 3.0 79.0 10.0
4 1.0 2.0 1.0 88.0
5 1.0 82.0 1.0 7.0
which can be concatenated to the original dataframe.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to merge tab separated data (always starting with letters) into one string? - string

Related

How to add a column in pandas dataframe based on other columns for large dataset?

Pandas dataframe drop rows which store certain number of zeros in it

line feed inside row in column with pandas

how to get specific column names from a pandas DF that satisfies a condition

pandas fill column with random numbers with a total for each row

Categories

Resources