Convert multiple rows into a single column - linux

Im trying to "convert" the following file from multiple rows into separated column.
classr#94 mesur#237 high#228 cash#232
classr#118 mesur#332 high#430 cash#421 Sar#380
classr#57 mesur#89 hight#65
My desired output:
classr#94
mesur#237
high#228
cash#232
classr#118
mesur#332
high#430
cash#421
Sar#380
classr#57
mesur#89
hight#65
I tried
datamash -t: transpose < Filename but converted my file in very "weird" way
I also tried grep -o # File_name but i got only the #.
I think in the grep case if I find the way to get the entire word I will obtain the desired output.

cat filetoconvert | tr " " "\n"

Related

How to format gcloud compute instances list output to excel format

Tried various approaches, but nearest to working one:
Replace multiple spaces with single one
Replace commas(,) in INTERNAL_IP column with Pipe(|)
Remove 4th cloumn (PREEMPTIBLE) as it was causing IPs in INTERNAL_IP cloumn shift under it.
Replace space with comma(,) to prepare a csv file.
But did not work. Gets messed up at PREEMPTIBLE cloumn.
gcloud compute instances list > file1
tr -s " " < file1 > file2 // to replace multiple spaces with single one
sed s/\,/\|/g file2 > file3 // to replace , with pipe
awk '{$4=""; print $0}' file3 // to remove 4th column
sed -e 's/\s\+/,/g' file3 > final.csv
Output of gcloud compute instances list command:
Expected format:
Any help or suggestion is appreciated. Thank you in advance.
Edit:
Attached sample input and expected output files:
sample_input.txt
expected_output.xlsx
csv format is supported in gcloud CLI so everything you are doing can be done without sed/awk maybe with | tail -n +2 if you want to skip the column header :
gcloud compute instances list --format="csv(NAME,ZONE,MACHINE_TYPE,PREEMPTIBLE,INTERNAL_IP,EXTERNAL_IP,STATUS)" > final.csv
Or if you wanted to do something with the data in your bash script:
while IFS="," read -r NAME ZONE MACHINE_TYPE PREMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS
do
echo "NAME=$NAME ZONE=$ZONE MACHINE_TYPE=$MACHINE_TYPE PREMPTIBLE=$PREMPTIBLE INTERNAL_IP=$INTERNAL_IP EXTERNAL_IP=$EXTERNAL_IP STATUS=$STATUS"
done < <(gcloud compute instances list --format="csv(NAME,ZONE,MACHINE_TYPE,PREEMPTIBLE,INTERNAL_IP,EXTERNAL_IP,STATUS)" | tail -n +2 | awk ' BEGIN {print "NAME,ZONE,MACHINE_TYPE,PREMPTIBLE,INTERNAL_IP,EXTERNAL_IP,STATUS"} {print $1","$2","$3","" "","$4","" "","$5}' )
Based on attached files sample input & expected output i have made following change :
Some of the instances for multiple internal IPs and they are
separated by ",". I have replaced that "," with "-" using sed
's/,/-/g' to aviod conflicts with other fields as we are
generating a CSV.
Displaying $4 & $6 in 5th & 7th columns so that they will be aligned
with Column Headers Internal IP Address and Status
cat command_output.txt | grep -v 'NAME' | sed 's/,/-/g' | awk ' BEGIN {print "NAME,ZONE,MACHINE_TYPE,PREMPTIBLE,INTERNAL_IP,EXTERNAL_IP,STATUS"} {print $1","$2","$3","" "","$4","" "","$5}'

extract sequences from multifasta file by ID in file using awk

I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs.
FASTA file seq.fasta:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11605
TTCAGCAAGCCGAGTCCTGCGTCGAGAGTTCAAGTC
CCTGTTCGGGCGCCACTGCTAG
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
>7P58X:01334:11635
TTCAGCAAGCCGAGTCCTGCGTCGAGAGATCGCTTT
CAAGTCCCTGTTCGGGCGCCACTGCGGGTCTGTGTC
GAGCG
>7P58X:01336:11621
ACGCTCGACACAGACCTTTAGTCAGTGTGGAAATCT
CTAGCAGTAGAGGAGATCTCCTCGACGCAGGACT
IDs file id.txt:
7P58X:01332:11636
7P58X:01334:11613
I want to get the fasta file with only those sequences matching the IDs in the id.txt file:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
I really like the awk approach I found in answers here and here, but the code given there is still not working perfectly for the example I gave. Here is why:
(1)
awk -v seq="7P58X:01332:11636" -v RS='>' '$1 == seq {print RS $0}' seq.fasta
this code works well for the multiline sequences but IDs have to be inserted separately to the code.
(2)
awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' id.txt seq.fasta
this code can take the IDs from the id.txt file but returns only the first line of the multiline sequences.
I guess that the good thing would be to modify the RS variable in the code (2) but all of my attempts failed so far. Can, please, anybody help me with that?
$ awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' id.txt seq.fasta
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
Following awk may help you on same.
awk 'FNR==NR{a[$0];next} /^>/{val=$0;sub(/^>/,"",val);flag=val in a?1:0} flag' ids.txt fasta_file
I'm facing a similar problem. The size of my multi-fasta file is ~ 25G.
I use sed instead of awk, though my solution is an ugly hack.
First, I extracted the line number of the title of each sequence to a data file.
grep -n ">" multi-fasta.fa > multi-fasta.idx
What I got is something like this:
1:>DM_0000000004
5:>DM_0000000005
11:>DM_0000000007
19:>DM_0000000008
23:>DM_0000000009
Then, I extracted the wanted sequence by its title, eg. DM_0000000004, using the scripts below.
seqnm=$1
idx0_idx1=`grep -n $seqnm multi-fasta.idx`
idx0=`echo $idx0_idx1 | cut -d ":" -f 1`
idx0plus1=`expr $idx0 + 1`
idx1=`echo $idx0_idx1 | cut -d ":" -f 2`
idx2=`head -n $idx0plus1 multi-fasta.idx | tail -1 | cut -d ":" -f 1`
idx2minus1=`expr $idx2 - 1`
sed ''"$idx1"','"$idx2minus1"'!d' multi-fasta.fa > ${seqnm}.fasta
For example, I want to extract the sequence of DM_0000016115. The idx0_idx1 variable gives me:
7507:42520:>DM_0000016115
7507 (idx0) is the line number of line 42520:>DM_0000016115 in multi-fasta.idx.
42520 (idx1) is the line number of line >DM_0000016115 in multi-fasta.fa.
idx2 is the line number of the sequence title right beneath the wanted one (>DM_0000016115).
At last, using sed, we can extract the lines between idx1 and idx2 minus 1, which are the title and the sequence, in which case you can use grep -A.
The advantage of this ugly-hack is that it does not require a specific number of lines for each sequence in the multi-fasta file.
What bothers me is this process is slow. For my 25G multi-fasta file, such extraction takes tens of seconds. However, it's much faster than using samtools faidx .

how to count occurrence of specific word in group of file by bash/shellscript

i have two text files 'simple' and 'simple1' with following data in them
simple.txt--
hello
hi hi hello
this
is it
simple1.txt--
hello hi
how are you
[]$ tr ' ' '\n' < simple.txt | grep -i -c '\bh\w*'
4
[]$ tr ' ' '\n' < simple1.txt | grep -i -c '\bh\w*'
3
this commands show the number of words that start with "h" for each file but i want to display the total count to be 7 i.e. total of both file. Can i do this in single command/shell script?
P.S.: I had to write two commands as tr does not take two file names.
Try this, the straightforward way :
cat simple.txt simple1.txt | tr ' ' '\n' | grep -i -c '\bh\w*'
This alternative requires no pipelines:
$ awk -v RS='[[:space:]]+' '/^h/{i++} END{print i+0}' simple.txt simple1.txt
7
How it works
-v RS='[[:space:]]+'
This tells awk to treat each word as a record.
/^h/{i++}
For any record (word) that starts with h, we increment variable i by 1.
END{print i+0}
After we have finished reading all the files, we print out the value of i.
It is not the case, that tr accepts only one filename, it does not accept any filename (and always reads from stdin). That's why even in your solution, you didn't provide a filename for tr, but used input redirection.
In your case, I think you can replace tr by fmt, which does accept filenames:
fmt -1 simple.txt simple1.txt | grep -i -c -w 'h.*'
(I also changed the grep a bit, because I personally find it better readable this way, but this is a matter of taste).
Note that both solutions (mine and your original ones) would count a string consisting of letters and one or more non-space characters - for instance the string haaaa.hbbbbbb.hccccc - as a "single block", i.e. it would only add 1 to the count of "h"-words, not 3. Whether or not this is the desired behaviour, it's up to you to decide.

grep whole line if character on position 1 and stings found in line

Hi i'm trying to combine multiple conditions to grep a line from a file.
Eg.
ASTRING1,B1TEST_NAME,/shared/dir1/dir2/dir3/dir4
ASTRING2,B1TEST_NAME,/shared/dir1/dir2/dir3/dir4
ASTRING3,B1TEST_NAME,/shared/dir1/dir2/dir3/dir4
BSTRING1,B1TEST_NAME,/shared/dir1/dir2/dir3/dir4
BSTRING2,B1TEST_NAME,/shared/dir1/dir2/dir3/dir4
BSTRING3,B1TEST_NAME,/shared/dir1/dir2/dir3/dir4
CSTRING1,B1TEST_NAME,/shared/dir1/dir2/dir3/dir4
CSTRING1,B1TEST_NAME,/shared/dir1/dir2/dir3/dir4
CSTRING3,B1TEST_NAME,/shared/dir1/dir2/dir3/dir4
.
.
I want to output the lines into a file only if "B" on the Position 1 AND the string "B1TEST" after the first "," AND the string "dir3" after the second "," exists.
Is it possible with grep? Or is there a "better" command?
I couldn't find anything similar ...
Thanks
if your input is column based, you can consider awk. However for this requirement, the following grep line should help:
grep '^B[^,]*,B1TEST[^,]*,.*/dir3/' file
It will give you output:
BSTRING1,B1TEST_NAME,/shared/dir1/dir2/dir3/dir4
BSTRING2,B1TEST_NAME,/shared/dir1/dir2/dir3/dir4
BSTRING3,B1TEST_NAME,/shared/dir1/dir2/dir3/dir4

Bash CSV sorting and unique-ing

a Linux question: I have the CSV file data.csv with the following fields and values
KEY,LEVEL,DATA
2.456,2,aaa
2.456,1,zzz
0.867,2,bbb
9.775,4,ddd
0.867,1,ccc
2.456,0,ttt
...
The field KEY is a float value, while LEVEL is an integer. I know that the first field can have repeated values, as well as the second one, but if you take them together you have a unique couple.
What I would like to do is to sort the file according to the column KEY and then for each unique value under KEY keep only the row having the higher value under LEVEL.
Sorting is not a problem:
$> sort -t, -k1,2 data.csv # fields: KEY,LEVEL,DATA
0.867,1,ccc
0.867,2,bbb
2.456,0,ttt
2.456,1,zzz
2.456,2,aaa
9.775,4,ddd
...
but then how can I filter the rows so that I get what I want, which is:
0.867,2,bbb
2.456,2,aaa
9.775,4,ddd
...
Is there a way to do it using command line tools like sort, uniq, awk and so on? Thanks in advance
try this line:
your sort...|awk -F, 'k&&k!=$1{print p}{p=$0;k=$1}END{print p}'
output:
kent$ echo "0.867,1,bbb
0.867,2,ccc
2.456,0,ttt
2.456,1,zzz
2.456,2,aaa
9.775,4,ddd"|awk -F, 'k&&k!=$1{print p}{p=$0;k=$1}END{print p}'
0.867,2,ccc
2.456,2,aaa
9.775,4,ddd
The idea is, because your file is already sorted, just go through the file/input from top, if the first column (KEY) changed, print the last line, which is the highest value of LEVEL of last KEY
try with your real data, it should work.
also the whole logic (with your sort) could be done by awk in single process.
Use:
$> sort -r data.csv | uniq -w 5 | sort
given your floats are formatted "0.000"-"9.999"
Perl solution:
perl -aF, -ne '$h{$F[0]} = [#F[1,2]] if $F[1] > $h{$F[0]}[0]
}{
print join ",", $_, #{$h{$_}} for sort {$a<=>$b} keys %h' data.csv
Note that the result is different from the one you requested, the first line contains bbb, not ccc.

Resources