Splitting lines based on a delimiter in UNIX - string

I have some data which is being returned by some SQL query which looks as below.I am trying to separate the lines based on a delimiter and send it to the new line.How can I do this in UNIX.. I tried using shell-scripting but couldn't make through...
ALB|1001|2012-04-15 ALB|1001|2012-04-14 ALB|1001|2012-04-16 ALB|1001|2012-04-17
ALB|1001|2012-04-15
ALB|1001|2012-04-14
ALB|1001|2012-04-16
ALB|1001|2012-04-17

For that particular example, tr ' ' '\n' < file ought to work:
echo "ALB|1001|2012-04-15 ALB|1001|2012-04-14 ALB|1001|2012-04-16 ALB|1001|2012-04-17" | tr ' ' '\n'

xargs is a simple single program you can use to do this, as in:
$ echo "ALB|1001|2012-04-15 ALB|1001|2012-04-14 ALB|1001|2012-04-16 ALB|1001|2012-04-17"|xargs -d' ' -n1
ALB|1001|2012-04-15
ALB|1001|2012-04-14
ALB|1001|2012-04-16
ALB|1001|2012-04-17

Related

parsing data from log using awk

I want to extract machineId userId origReqUri,filename,mime,size,checksum as comma-separated from this log pattern. Any awk command to do it?
test1.1/test.log.2020-07-14-20:2020-07-14 20:47:44,239 [http--1594759553405 sessionId:4567 nodeId:node-1 machineId:31656 userId:2540397 origReqUri:/test1/batch] INFO com.test.company - [RETURN INFO - RETURN] - TRACK_PREPROCESSED_DATA_POPULATION: Populated test_doc_version entry for doc version [1130783_1_0] with data from test_doc_metadata. File name: [09014b3080135f44.doc]. Mime type: [application/msword]. Content size: [100352]. MD5 checksum: [7ef30e834107990c95c7e53f7b6f6ee6]. [source:]
I tried
grep machineId:31656 test.1/test.log.2020-07-14-* |grep "Populated test_doc_version entry" | awk machineId |awk origReqUri
I didn't use AWK, but I would resolve your problem using mostly SED and GREP, like this:
sed s/': '/':'/g input | sed s/' '/\\n/g | grep 'machineId\|userId\|origReqUri\|name\|type\|size\|checksum' | sed 's/\[\|\]\|\.//g' | tr '\n' ',' | sed 's/name/filename/g' | sed 's/type/mime/g' | sed 's/.$//'
ps.: "input" is the name of the file where I wrote the input.
The result for the provided input is:
machineId:31656,userId:2540397,origReqUri:/test1/batch,filename:09014b3080135f44doc,mime:application/msword,size:100352,checksum:7ef30e834107990c95c7e53f7b6f6ee6
It is probably not the best solution and we can certainly make it smaller and more beautiful, but I hope it helps you.
There's another solution, simpler and way more readable. You could do like this:
tr -s ' :[]' ' ' < input | cut -d ' ' -f 12,14,16,39,43,47,51
In here, it's not comma-separated. I guess it's better not to use commas since they are in the list of special symbols.
The result for this one is:
31656 2540397 /test1/batch 09014b3080135f44.doc application/msword 100352 7ef30e834107990c95c7e53f7b6f6ee6

extract sequences from multifasta file by ID in file using awk

I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs.
FASTA file seq.fasta:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11605
TTCAGCAAGCCGAGTCCTGCGTCGAGAGTTCAAGTC
CCTGTTCGGGCGCCACTGCTAG
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
>7P58X:01334:11635
TTCAGCAAGCCGAGTCCTGCGTCGAGAGATCGCTTT
CAAGTCCCTGTTCGGGCGCCACTGCGGGTCTGTGTC
GAGCG
>7P58X:01336:11621
ACGCTCGACACAGACCTTTAGTCAGTGTGGAAATCT
CTAGCAGTAGAGGAGATCTCCTCGACGCAGGACT
IDs file id.txt:
7P58X:01332:11636
7P58X:01334:11613
I want to get the fasta file with only those sequences matching the IDs in the id.txt file:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
I really like the awk approach I found in answers here and here, but the code given there is still not working perfectly for the example I gave. Here is why:
(1)
awk -v seq="7P58X:01332:11636" -v RS='>' '$1 == seq {print RS $0}' seq.fasta
this code works well for the multiline sequences but IDs have to be inserted separately to the code.
(2)
awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' id.txt seq.fasta
this code can take the IDs from the id.txt file but returns only the first line of the multiline sequences.
I guess that the good thing would be to modify the RS variable in the code (2) but all of my attempts failed so far. Can, please, anybody help me with that?
$ awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' id.txt seq.fasta
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
Following awk may help you on same.
awk 'FNR==NR{a[$0];next} /^>/{val=$0;sub(/^>/,"",val);flag=val in a?1:0} flag' ids.txt fasta_file
I'm facing a similar problem. The size of my multi-fasta file is ~ 25G.
I use sed instead of awk, though my solution is an ugly hack.
First, I extracted the line number of the title of each sequence to a data file.
grep -n ">" multi-fasta.fa > multi-fasta.idx
What I got is something like this:
1:>DM_0000000004
5:>DM_0000000005
11:>DM_0000000007
19:>DM_0000000008
23:>DM_0000000009
Then, I extracted the wanted sequence by its title, eg. DM_0000000004, using the scripts below.
seqnm=$1
idx0_idx1=`grep -n $seqnm multi-fasta.idx`
idx0=`echo $idx0_idx1 | cut -d ":" -f 1`
idx0plus1=`expr $idx0 + 1`
idx1=`echo $idx0_idx1 | cut -d ":" -f 2`
idx2=`head -n $idx0plus1 multi-fasta.idx | tail -1 | cut -d ":" -f 1`
idx2minus1=`expr $idx2 - 1`
sed ''"$idx1"','"$idx2minus1"'!d' multi-fasta.fa > ${seqnm}.fasta
For example, I want to extract the sequence of DM_0000016115. The idx0_idx1 variable gives me:
7507:42520:>DM_0000016115
7507 (idx0) is the line number of line 42520:>DM_0000016115 in multi-fasta.idx.
42520 (idx1) is the line number of line >DM_0000016115 in multi-fasta.fa.
idx2 is the line number of the sequence title right beneath the wanted one (>DM_0000016115).
At last, using sed, we can extract the lines between idx1 and idx2 minus 1, which are the title and the sequence, in which case you can use grep -A.
The advantage of this ugly-hack is that it does not require a specific number of lines for each sequence in the multi-fasta file.
What bothers me is this process is slow. For my 25G multi-fasta file, such extraction takes tens of seconds. However, it's much faster than using samtools faidx .

how to count occurrence of specific word in group of file by bash/shellscript

i have two text files 'simple' and 'simple1' with following data in them
simple.txt--
hello
hi hi hello
this
is it
simple1.txt--
hello hi
how are you
[]$ tr ' ' '\n' < simple.txt | grep -i -c '\bh\w*'
4
[]$ tr ' ' '\n' < simple1.txt | grep -i -c '\bh\w*'
3
this commands show the number of words that start with "h" for each file but i want to display the total count to be 7 i.e. total of both file. Can i do this in single command/shell script?
P.S.: I had to write two commands as tr does not take two file names.
Try this, the straightforward way :
cat simple.txt simple1.txt | tr ' ' '\n' | grep -i -c '\bh\w*'
This alternative requires no pipelines:
$ awk -v RS='[[:space:]]+' '/^h/{i++} END{print i+0}' simple.txt simple1.txt
7
How it works
-v RS='[[:space:]]+'
This tells awk to treat each word as a record.
/^h/{i++}
For any record (word) that starts with h, we increment variable i by 1.
END{print i+0}
After we have finished reading all the files, we print out the value of i.
It is not the case, that tr accepts only one filename, it does not accept any filename (and always reads from stdin). That's why even in your solution, you didn't provide a filename for tr, but used input redirection.
In your case, I think you can replace tr by fmt, which does accept filenames:
fmt -1 simple.txt simple1.txt | grep -i -c -w 'h.*'
(I also changed the grep a bit, because I personally find it better readable this way, but this is a matter of taste).
Note that both solutions (mine and your original ones) would count a string consisting of letters and one or more non-space characters - for instance the string haaaa.hbbbbbb.hccccc - as a "single block", i.e. it would only add 1 to the count of "h"-words, not 3. Whether or not this is the desired behaviour, it's up to you to decide.

Bash + remove spaces from line

I wrote the following bash code in order to create CSV path with disks partitions , so each partition will get new increment dev disk
number_of_disks=5
mount_p=({a..z})
path=` for i in \`seq 1 $number_of_disks \`; do mount_p="$(echo $mount_p| tr '[a-z]' '[b-z]a')"; echo /home/sd$mount_p/oop/app/data","; done `
but when I print the $path we get space between each partion
echo $path
/home/sdb/oop/app/data, /home/sdc/oop/app/data,
/home/sdd/oop/app/data, /home/sde/oop/app/data,
/home/sdf/oop/app/data,
Second problem is the unnecessary "," at the end of the line
Based on my code how to create the path variable without space and without "," at the end of the CSV line
You are using a very complex way (hacky as hell) to achieve something rather simple:
path=$(echo /home/sd{a..e}/oop/app/data | tr ' ' ,)
You can change your path variable like tihs:
echo $path | sed 's/ //g;s/,$//g'
It will remove last ',' and spaces.
UPD.
Or:
path=( $(echo /home/sd{b..z}/oop/app/data | tr ' ' ',') ); echo "${path[#]}"

Squeezing spaces between columns in Unix shell

I want the spaces to be removed between two columns.
After running a sql query from shell, I'm getting the output as below:
23554402243 0584940772;2TZ0584940772001U;
23554402272 0423721840;7TT0423721840001B;
23554402303 0110770863;BBTU500248822001Q;
23554402305 02311301;BTB02311301001J;
23554402563 0550503408;PPTU004984208001O;
23554402605 0457553223;Q0T0457553223001I;
23554367602 0454542427;TB8U501674990001V;
23554378584 0383071261;HTHU500374797001Y;
23554404965 059792244;ST3059792244005C;
23554405503 0571632586;QTO0571632586001D;
But the desired output should be like below:
23554400043 0117601738;22TU003719388001V;
23554402883 0823973229;TTT0823973229001C;
23554402950 024071080;MNT024071080001D;
23554405827 0415260614;TL20415260614001R;
23554405828 08119270800;TL2U003010407001G;
23554406553 011306895;VBT011306895001E;
23554406557 054121509;TL2054121509001M;
23554406563 065069209;TL2065069209005M;
23554409085 0803434328;QTO0803434328001B;
23553396219 062004063;G6T062004063001C;
Remember, there should be only one tabspace between two columns in the desired output.
Assuming you need to remove space between all the columns:
If you need tab spaced result between first two columns. Put g to apply changes between all the columns.
sed -r 's/\s+/\t/' inputfile
if -r is not available:
sed 's/\s\+/\t/'
or If you need single space between every multi-space
tr -s ' '
Easy to do using this awk:
awk -v OFS='\t' '{$1=$1} 1' file
23554402243 0584940772;2TZ0584940772001U;
23554402272 0423721840;7TT0423721840001B;
23554402303 0110770863;BBTU500248822001Q;
23554402305 02311301;BTB02311301001J;
23554402563 0550503408;PPTU004984208001O;
23554402605 0457553223;Q0T0457553223001I;
23554367602 0454542427;TB8U501674990001V;
23554378584 0383071261;HTHU500374797001Y;
23554404965 059792244;ST3059792244005C;
23554405503 0571632586;QTO0571632586001D;
Alternatively this tr will also work:
tr -s ' ' < file | tr ' ' '\t'
or this sed:
sed -i.bak $'s/ \{1,\}/\t/g' file
what about the following perl one-liner?
perl -ne '/(.*?)\s+(.*)/; print "$1\t$2\n"' your_input_file

Resources