Match lines based on patterns and reformat file Bash/ Linux

Match lines based on patterns and reformat file Bash/ Linux - linux

I am looking preferably for a bash/Linux method for the problem below.
I have a text file (input.txt) that looks like so (and many many more lines):
TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34 CC_LlanR
GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22 CC_LlanR
TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11 EN_DavaW
TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23 CC_LlanR
CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06 EN_DavaW
index_07_barcode_04_PA-17-ACW-04 17-ACW
index_09_barcode_05_PA-17-ACW-05 17-ACW
index_08_barcode_37_PA-21-YC-15 21-YC
index_09_barcode_04_PA-22-GB-10 22-GB
index_10_barcode_37_PA-28-CC-17 28-CC
index_11_barcode_29_PA-32-MW-07 32-MW
index_11_barcode_20_PA-32-MW-08 32-MW
I want to produce a file that looks like
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22,TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11,CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
I thought that I could do something along the lines of this.
cat input.txt | awk '{print $1}' | grep -e "CC_LlanR" | paste -sd',' > intermediate_file
cat input.txt | awk '{print $2"("}' something something??
But I only know how to grep one pattern at a time? Is there a way to find all the matching lines at once and output them in this format?
Thank you!
(Happy Easter/ long weekend to all!)

With your shown samples please try following.
awk '
FNR==NR{
arr[$2]=(arr[$2]?arr[$2]",":"")$1
next
}
($2 in arr){
print $2"("arr[$2]")"
delete arr[$2]
}
' Input_file Input_file
2nd solution: Within a single read of Input_file try following.
awk '{arr[$2]=(arr[$2]?arr[$2]",":"")$1} END{for(i in arr){print i"("arr[i]")"}}' Input_file
Explanation(1st solution): Adding detailed explanation for 1st solution here.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
arr[$2]=(arr[$2]?arr[$2]",":"")$1 ##Creating array with index of 2nd field and keep adding its value with comma here.
next ##next will skip all further statements from here.
}
($2 in arr){ ##Checking condition if 2nd field is present in arr then do following.
print $2"("arr[$2]")" ##Printing 2nd field ( arr[$2] ) here.
delete arr[$2] ##Deleteing arr value with 2nd field index here.
}
' Input_file Input_file ##Mentioning Input_file names here.

Assuming your input is grouped by the $2 value as shown in your example (if it isn't then just run sort -k2,2 on your input first) using 1 pass and only storing one token at a time in memory and producing the output in the same order of $2s as the input:
$ cat tst.awk
BEGIN { ORS="" }
$2 != prev {
printf "%s%s(", ORS, $2
ORS = ")\n"
sep = ""
prev = $2
}
{
printf "%s%s", sep, $1
sep = ","
}
END { print "" }
$ awk -f tst.awk input.txt
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11)
CC_LlanR(TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)

This might work for you (GNU sed):
sed -E 's/^(\S+)\s+(\S+)/\2(\1)/;H
x;s/(\n\S+)\((\S+)\)(.*)\1\((\S+)\)/\1(\2,\4)\3/;x;$!d;x;s/.//' file
Append each manipulated line to the hold space.
Before moving on to the next line, accumlate like keys into a single line.
Delete every line except the last.
Replace the last line by the contents of the hold space.
Remove the first character (newline artefact introduced by H comand) and print the result.
N.B. The final solution is unsorted and in the original order.

Related

How to add a header to awk output?

I have a csv file that looks like below
"10.8.70.67","wireless",,"UTY_07_ISD",,26579
"10.8.70.69","wireless",,"RGB_34_FTR",,19780
I want to retrieve first, second and fourth column values (without quotes) and populate into a another csv in the below format.
IP DEVICETYPE DEVICENAME
10.8.70.67 wireless UTY_07_ISD
10.8.70.69 wireless RGB_34_FTR
I have used the below awk command
awk -F ',|,,' '{gsub(/"/,"",$1); gsub(/"/,"",$2); gsub(/"/,"",$3); print $1, $2, $3}' file.csv
and got the below output
10.8.70.67 wireless UTY_07_ISD
10.8.70.69 wireless RGB_34_FTR
please help in assigning headings to each column.

assuming you don't have comma or double quotes in the quoted strings (a big assumption!) it can be as simple as
$ awk -F, 'NR==1 {print "IP","DEVICETYPE","DEVICENAME"}
{gsub(/"/,"");
print $1,$2,$4}' file | column -t
IP DEVICETYPE DEVICENAME
10.8.70.67 wireless UTY_07_ISD
10.8.70.69 wireless RGB_34_FTR

With your shown samples, could you please try following. Written and tested in GNU awk.
awk -v FPAT='([^,]*)|("[^"]+")' '
BEGIN{
OFS=","
print "IP DEVICETYPE DEVICENAME"
}
function remove(fields){
num=split(fields,arr,",")
for(i=1;i<=num;i++){
gsub(/^"|"$/,"",$arr[i])
}
}
{
remove("1,2,4")
print $1,$2,$4
}
' Input_file
Explanation: Adding detailed explanation for above.
awk -v FPAT='([^,]*)|("[^"]+")' ' ##Setting FPAT to get only matched fields only as ([^,]*)|("[^"]+") as per samples.
BEGIN{ ##Starting BEGIN section of this program from here.
print "IP DEVICETYPE DEVICENAME" ##printing header here.
}
function remove(fields){ ##Creating function named remove here where we are passing field numbers from where we need to remove "
num=split(fields,arr,",") ##Splitting fields into arr here.
for(i=1;i<=num;i++){ ##Traversing through all items of arr here.
gsub(/^"|"$/,"",$arr[i]) ##Globally substituting starting and ending " in mentioned fields with NULL here.
}
}
{
remove("1,2,4") ##Calling remove here with field numbers of 1,2 and 4 which we need as per output.
print $1,$2,$4 ##Printing 1st, 2nd and 4th field here.
}
' Input_file ##Mentioning Input_file name here.

A simple oneliner will be:
awk -F ',|,,' 'BEGIN {format = "%-20s %-20s %-20s\n"; printf format, "IP", "DEVICETYPE", "DEVICENAME"} {gsub(/"/,"",$1); gsub(/"/,"",$2); gsub(/"/,"",$3); printf format, $1, $2, $3}' abc.csv
Here I have used BEGIN/END special pattern, which is used to do some startup or cleanup actionn, to add headings. For more details please refer to the documentation Using BEGIN/END

I got the expected output with the below command
awk -F ',|,,' 'BEGIN {print "IP,DEVICETYPE,DEVICENAME"} {gsub(/"/, "", $1); gsub(/"/, "", $2); gsub(/"/, "", $3); print $1","$2","$3}' input.csv > output.csv
I found that I was missing BEGIN part. Thanks all for your response.

In file B find patterns from file A and replace with patterns from file C, line by line

I have a file of patterns (fileA.txt) which need to be searched in a large file (fileB.txt) and they need to be replaced with patterns in another file (fileC.txt)
Example:
fileB.txt
4472534
8BC4232
3533221
333553D
8645141
2412AAA
I want to search this patterns in fileB:
fileA.txt
BC423
33221
12AAA
Then I want to replace them with patterns in fileC, line by line:
fileC.txt
66FF7
11GYT
2HHJK
Expected output:
4472534
866FF72
3511GYT
333553D
8645141
242HHJK
I wrote something like this:
grep -f fileA.txt fileB.txt | xargs sed -i fileC.txt
however, it searches correctly the patterns but the substitution is probably not correct.
Any advice?
fileA (pattern to search)
CAAGATTTTCTTTGCCGAGACTCAGTGGGG
fileB
>AMP_4 RS0255 CENPF__ENST00000366955.7__6322__30__0.43333__69.25__1 RS0247
CAGTTGTGCAATTTGGTTTTCCAGCTCACA
>AMP_4 RS0451 CENPF__ENST00000366955.7__10108__30__0.5__71.1396__1 RS0247
GAAGCCTGCAGCCCTCACTGGAAATAAACA
>AMP_4 RS0451 CENPF__ENST00000366955.7__9236__30__0.5__69.816__1 RS0332
CAAGATTTTCTTTGCCGAGACTCAGTGGGG
>AMP_4 RS0451 CENPF__ENST00000366955.7__8140__30__0.43333__68.033__1RS0255
GAGCTCCTTCAATTGATCTTTGCTGCTCTT
fileC (pattern to replace)
GGAGGATGGTGCCTGAATCTACTGGGCTCC

This should be a task for awk, could you please try following written and tested with shown samples in GNU awk.
awk '
FNR==NR{
arr[$0]=FNR
next
}
FILENAME=="fileC.txt"{
arrVal[++count]=$0
next
}
FILENAME=="fileB.txt"{
for(key in arr){
if(sub(key,arrVal[arr[key]])){
break
}
}
print
}
' fileA.txt fileC.txt fileB.txt
Output will be as follows.
4472534
866FF72
3511GYT
333553D
8645141
242HHJK
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when fileA.txt is being read.
arr[$0]=FNR ##Creating arr with index of current line and value of current line number.
next ##next will skip all further statements from here.
}
FILENAME=="fileC.txt"{ ##Checking condition if file name is fileC.txt then do following.
arrVal[++count]=$0 ##Creating arrVal with index of count increasing value of 1 and having current line as its value.
next ##next will skip all further statements from here.
}
FILENAME=="fileB.txt"{ ##Checking condition if file name is fileB.txt then do
for(key in arr){ ##Traversing through array arr here.
if(sub(key,arrVal[arr[key]])){ ##Checking condition if substitution of arrVal[arr[key]] is successfully done with key in current line, which basically changes the values in fileB values.
break ##Come out of loop to save some cycles.
}
}
print ##Printing current line here.
}
' fileA.txt fileC.txt fileB.txt ##Mentioning Input_file names here.
NOTE: We could also use ARGC conditions check in place of file name checks too in above.

paste fileA fileC \
|awk 'NR==FNR{ mapping[$1] =$2; next }
{ for(pat in mapping){
gsub(pat, mapping[pat])
};
print
}' - fileB

You could use sed to generate a sed script that would replace them:
sed "$(paste fileA.txt fileC.txt | sed 's/\(.*\)\t\(.*\)/s#\1#\2#g/')" fileB.txt

Here is a one liner with paste + awk + sed:
sed -f <(awk '{printf "s/%s/%s/g\n",$1,$2}' <(paste file{A,C}.txt)) fileB.txt
4472534
866FF72
3511GYT
333553D
8645141
242HHJK

This might work for you (GNU sed & parallel):
parallel echo 's/{1}/{2}/' ::::+ file[AC] | sed -f - fileB
Build a sed script and then run the script with fileB as input.
N.B. ::::+ emulates the paste command and {1} and {2} the values of each line from fileA and fileC.

Append multiple lines from one file to another

Hello Guys,
I have two files that have same number of lines (503 exactly) and i want to append the contents of one file to the other right infront of them without any space or tab. Let us consider the input file contents are:
One.txt
C:\Users\Desktop.VG\New-folder\?filename=
C:\Users\Desktop.VG\New-folder\?filename=
C:\Users\Desktop.VG\New-folder\?filename=
C:\Users\Desktop.VG\New-folder\?filename=
C:\Users\Desktop.VG\New-folder\?filename=
Two.txt
one
val_ilu_girl
pacmanhall
four_stars
squares3
Now I want like this:
C:\Users\Desktop.VG\New-folder\?filename=one
C:\Users\Desktop.VG\New-folder\?filename=val_ilu_girl
C:\Users\Desktop.VG\New-folder\?filename=pacmanhall
C:\Users\Desktop.VG\New-folder\?filename=four_stars
C:\Users\Desktop.VG\New-folder\?filename=squares3
Is there a way to do this using anything from SED GREP AWK to EXCEL Notepadd++ etc.?
Thanks in advance...!

Could you please try following.
awk 'FNR==NR{a[FNR]=$0;next} {print $0 a[FNR]}' two.txt one.txt
Explanation: Adding explanation for above code.
awk ' ##Starting awk program here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file is being read.
a[FNR]=$0 ##Creating an array named a whose index is FNR and value is current line.
next ##next will skip all further statements from here.
} ##Closing BLOCK for FNR==NR condition here.
{ ##These statements will be executed when 2nd Input_file is being read.
print $0 a[FNR] ##Printing current line along with array a value with index of FNR.
}
' two.txt one.txt ##Mentioning Input_file names here.

That;s the job paste was invented to do:
$ paste -d '' one.txt two.txt
C:\Users\Desktop.VG\New-folder\?filename=one
C:\Users\Desktop.VG\New-folder\?filename=val_ilu_girl
C:\Users\Desktop.VG\New-folder\?filename=pacmanhall
C:\Users\Desktop.VG\New-folder\?filename=four_stars
C:\Users\Desktop.VG\New-folder\?filename=squares3

gsub in awk with variable

I want to replace the ">" with variable names staring with ">" and ends with ".". But the following code is not printing the variable names.
for f in *.fasta;
do
nam=$(basename $f .fasta);
awk '{print $f}' $f | awk '{gsub(">", ">$nam."); print $0}'; done
Input of first file sample01.fasta:
cat sample01.fasta:
>textofDNA
ATCCCCGGG
>textofDNA2
ATCCCCGGGTTTT
Output expected:
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT

$ awk 'FNR==1{fname=FILENAME; sub(/[^.]+$/,"",fname)} sub(/^>/,""){$0=">" fname $0} 1' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Compared to the other answers you've got so far, the above will work in any awk, only does the file name calculation once per input file rather than once per line or once per >-line, won't fail if the file name contains other .s, won't fail if the file name contains &, and won't fail if the file name doesn't contain the string fasta..

Or like this? You don't really need the looping and basename or two awk invocations.
awk '{stub=gensub( /^([^.]+\.)fasta.*/ , "\\1", "1",FILENAME ) ; gsub( />/, ">"stub); print}' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Explanation: awk has knowledge of the filename it currently operates on through the built-in variable FILENAME; I strip the .fasta extension using gensub, and store it in the variable stub. The I invoke gsub to replace ">" with ">" and the content of my variable stub. After that I print it.
As Ed points out in the comments: gensub is a GNU extension and won't work on other awk implementations.

Could you please try following too.
awk '/^>/{split(FILENAME,array,".");print substr($0,1,1) array[1]"." substr($0,2);next} 1' Input_file
Explanation: Adding explanation for above code here.
awk '
/^>/{ ##Checking condition if a line starts from > then do following.
split(FILENAME,array,".") ##Using split function of awk to split Input_file name here which is stored in awk variable FILENAME.
print substr($0,1,1) array[1]"." substr($0,2) ##Printing substring to print 1st char then array 1st element and then substring from 2nd char to till last of line.
next ##next will skip all further statements from here.
}
1 ##1 will print all lines(except line that are starting from >).
' sample01.fasta ##Mentioning Input_file name here.

Find a pattern and replace

This is the input to my file.
Number : 123
PID : IIT/123/Dakota
The expected output is :
Number : 111
PID : IIT/111/Dakota
I want to replace 123 to 111. To solve this I have tried following:
awk '/Number/{$NF=111} 1' log.txt
awk -F '[/]' '/PID/{$2="123"} 1' log.txt

Use sed for something this simple ?
Print the change to the screen (test with this) :
sed -e 's:123:111:g' f2.txt
Update the file (with this) :
sed -i 's:123:111:g' f2.txt
Example:
$ sed -i 's:123:111:g' f2.txt
$ cat f2.txt
Number : 111
PID : IIT/111/Dakota

EDIT2: Or you want to substitute each line's 123 with 111 without checking any condition which you tried in your awk then simply do:
awk '{sub(/123/,"111")} 1' Input_file
Change sub to gsub in case of many occurrences of 123 in a single line too.
Explanation of above code:
awk -v new_value="111" ' ##Creating an awk variable named new_value where OP could keep its new value which OP needs to be there in line.
/^Number/ { $NF=new_value } ##Checking if a line starts from Number string and then setting last field value to new_value variable here.
/^PID/ { num=split($NF,array,"/"); ##Checking if a line starts from PID then creating an array named array whose delimiter it / from last field value
array[2]=new_value; ##Setting second item of array to variable new_value here.
for(i=1;i<=num;i++){ val=val?val "/" array[i]:array[i] }; ##Starting a loop from 1 to till length of array and creating variable val to re-create last field of current line.
$NF=val; ##Setting last field value to variable val here.
val="" ##Nullifying variable val here.
}
1' Input_file ##Mentioning 1 to print the line and mentioning Input_file name here too.
EDIT: In case you need to / in your output too then use following awk.
awk -v new_value="111" '
/^Number/ { $NF=new_value }
/^PID/ { num=split($NF,array,"/");
array[2]=new_value;
for(i=1;i<=num;i++){ val=val?val "/" array[i]:array[i] };
$NF=val;
val=""
}
1' Input_file
Following awk may help you here.(Seems after I have applied code tags to your samples your sample input is changed a bit so editing my code accordingly now)
awk -F"[ /]" -v new_value="111" '/^Number/{$NF=new_value} /^PID/{$(NF-1)=new_value}1' Input_file
In case you want to save changes into Input_file itself append > temp_file &7 mv temp_file Input_file in above code then.
Explanation:
awk -F"[ /]" -v new_value="111" ' ##Setting field separator as space and / to each line and creating awk variable new_value which OP wants to have new value.
/^Number/{ $NF=new_value } ##Checking condition if a line is starting with string Number then change its last field to new_value value.
/^PID/ { $(NF-1)=new_value } ##Checking condition if a line starts from string PID then setting second last field to variable new_value.
1 ##awk works on method of condition then action, so putting 1 making condition TRUE here and not mentioning any action so by default print of current line will happen.
' Input_file ##Mentioning Input_file name here.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Match lines based on patterns and reformat file Bash/ Linux - linux

Related

How to add a header to awk output?

In file B find patterns from file A and replace with patterns from file C, line by line

Append multiple lines from one file to another

gsub in awk with variable

Find a pattern and replace

Categories

Resources