Select sepcific strings in different columns and print it AWK

Select sepcific strings in different columns and print it AWK - search

I have syntax :
awk -F'\t' '{for(i=1;i<=NF;i++) {if($i~/ensembl_gene_id*/) {h=$i}} ;for(a=1;a<=NF;a++) {if($a~/ensembl_gn*/) {z=$a}} print $1,$2,$3,z,h}'
This is syntax for search more strings in multiple unspecific fields separated by "\t" and print them. But my skills are not so good and I would like to rewrite it with only one loop (Now I have got two loops for "i" and "a"). Could you help me to get easier way with awk? (Code is working).
I think something like this :
awk -F'\t' '{for(i=1;i<=NF;i++) {if($i~/ensembl_gene_id* | esnembl_gn*/) {h=$i}} {print $1,$2,$3,h}'
But it prints only first match.
INPUT:
1 2 les ensembl_gene_id=aaa aha ensembl_gn=BRAF
2 3 pes ccds ensembl_gene_id=kkk ahl klkl ensembl_gn=OTC
2 2 ves ccds=1 ccds=2 ensembl_gene_id=cac ensembl_gn=BRCA
OUTPUT:
1 2 les ensembl_gene_id=aaa ensembl_gn=BRAF
2 3 pes ensembl_gene_id=kkk ensembl_gn=OTC
2 2 ves ensembl_gene_id=cac
Thank you

EDIT: After seeing OP's samples adding following solution.(change awk to awk 'BEGIN{FS=OFS="\t"} in case your Input_file is TAB delimited and your output should be in TAB delimited too.
awk '
match($0,/ensembl_gene_id[^ ]*/){
val=substr($0,RSTART,RLENGTH)
}
match($0,/ensembl_gn[^ ]*/){
val1=substr($0,RSTART,RLENGTH)
}
{
print $1,$2,$3,val,val1
val=val1=""
}
' Input_file
As far as I understood from your question(you want to run single for loop and check 2 conditions. if yes then we need not to use 2 loops rather we can use single loop with 2 conditions in it), could you please try following.
awk -F'\t' '{h=z="";for(i=1;i<=NF;i++){if($i~/ensembl_gene_id*/){h=$i};if($i~/ensembl_gn*/){z=$i}};print $1,$2,$3,z,h}' Input_file
OR(a non-one liner form of solution):
awk '
{
h=z=""
for(i=1;i<=NF;i++){
if($i~/ensembl_gene_id*/){
h=$i
}
if($i~/ensembl_gn*/){
z=$i
}
}
print $1,$2,$3,z,h
}
' Input_file
Issue with OP's attempt: It will always print 1 value only since in case other character's finding it will overwrite its previous value.

Are just trying to print the ensembl_gene_id and ensembl_gn fields? That'd be:
$ awk '{
delete f
for (i=1;i<=NF;i++) {
split($i,t,/=/)
f[t[1]] = $i
}
print $1, $2, $3, f["ensembl_gene_id"], f["ensembl_gn"]
}' file
1 2 les ensembl_gene_id=aaa ensembl_gn=BRAF
2 3 pes ensembl_gene_id=kkk ensembl_gn=OTC
2 2 ves ensembl_gene_id=cac ensembl_gn=BRCA

Related

Match lines based on patterns and reformat file Bash/ Linux

I am looking preferably for a bash/Linux method for the problem below.
I have a text file (input.txt) that looks like so (and many many more lines):
TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34 CC_LlanR
GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22 CC_LlanR
TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11 EN_DavaW
TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23 CC_LlanR
CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06 EN_DavaW
index_07_barcode_04_PA-17-ACW-04 17-ACW
index_09_barcode_05_PA-17-ACW-05 17-ACW
index_08_barcode_37_PA-21-YC-15 21-YC
index_09_barcode_04_PA-22-GB-10 22-GB
index_10_barcode_37_PA-28-CC-17 28-CC
index_11_barcode_29_PA-32-MW-07 32-MW
index_11_barcode_20_PA-32-MW-08 32-MW
I want to produce a file that looks like
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22,TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11,CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
I thought that I could do something along the lines of this.
cat input.txt | awk '{print $1}' | grep -e "CC_LlanR" | paste -sd',' > intermediate_file
cat input.txt | awk '{print $2"("}' something something??
But I only know how to grep one pattern at a time? Is there a way to find all the matching lines at once and output them in this format?
Thank you!
(Happy Easter/ long weekend to all!)

With your shown samples please try following.
awk '
FNR==NR{
arr[$2]=(arr[$2]?arr[$2]",":"")$1
next
}
($2 in arr){
print $2"("arr[$2]")"
delete arr[$2]
}
' Input_file Input_file
2nd solution: Within a single read of Input_file try following.
awk '{arr[$2]=(arr[$2]?arr[$2]",":"")$1} END{for(i in arr){print i"("arr[i]")"}}' Input_file
Explanation(1st solution): Adding detailed explanation for 1st solution here.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
arr[$2]=(arr[$2]?arr[$2]",":"")$1 ##Creating array with index of 2nd field and keep adding its value with comma here.
next ##next will skip all further statements from here.
}
($2 in arr){ ##Checking condition if 2nd field is present in arr then do following.
print $2"("arr[$2]")" ##Printing 2nd field ( arr[$2] ) here.
delete arr[$2] ##Deleteing arr value with 2nd field index here.
}
' Input_file Input_file ##Mentioning Input_file names here.

Assuming your input is grouped by the $2 value as shown in your example (if it isn't then just run sort -k2,2 on your input first) using 1 pass and only storing one token at a time in memory and producing the output in the same order of $2s as the input:
$ cat tst.awk
BEGIN { ORS="" }
$2 != prev {
printf "%s%s(", ORS, $2
ORS = ")\n"
sep = ""
prev = $2
}
{
printf "%s%s", sep, $1
sep = ","
}
END { print "" }
$ awk -f tst.awk input.txt
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11)
CC_LlanR(TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)

This might work for you (GNU sed):
sed -E 's/^(\S+)\s+(\S+)/\2(\1)/;H
x;s/(\n\S+)\((\S+)\)(.*)\1\((\S+)\)/\1(\2,\4)\3/;x;$!d;x;s/.//' file
Append each manipulated line to the hold space.
Before moving on to the next line, accumlate like keys into a single line.
Delete every line except the last.
Replace the last line by the contents of the hold space.
Remove the first character (newline artefact introduced by H comand) and print the result.
N.B. The final solution is unsorted and in the original order.

Splitting file based on first column's first character and length

I want to split a .txt into two, with one file having all lines where the first column's first character is "A" and the total of characters in the first column is 6, while the other file has all the rest. Searching led me to find the awk command and ways to separate files based on the first character, but I couldn't find any way to separate it based on column length.
I'm not familiar with awk, so what I tried (to no avail) was awk -F '|' '$1 == "A*****" {print > ("BeginsWithA.txt"); next} {print > ("Rest.txt")}' FileToSplit.txt.
Any help or pointers to the right direction would be very appreciated.
EDIT: As RavinderSingh13 reminded, it would be best for me to put some samples/examples of input and expected output.
So, here's an input example:
#FileToSplit.txt#
2134|Line 1|Stuff 1
31516784|Line 2|Stuff 2
A35646|Line 3|Stuff 3
641|Line 4|Stuff 4
A48029|Line 5|Stuff 5
A32100|Line 6|Stuff 6
413|Line 7|Stuff 7
What the expected output is:
#BeginsWith6.txt#
A35646|Line 3|Stuff 3
A48029|Line 5|Stuff 5
A32100|Line 6|Stuff 6
#Rest.txt#
2134|Line 1|Stuff 1
31516784|Line 2|Stuff 2
641|Line 4|Stuff 4
413|Line 7|Stuff 7

What you want to do is use a regex and length function. You don't show your input, so I will leave it to you to set the field separator. Given your description, you could do:
awk '/^A/ && length($1) == 6 { print > "file_a.txt"; next } { print > "file_b.txt" }' file
Which would take the information in file and if the first field begins with "A" and is 6 characters in length, the record is written to file_a.txt, otherwise the record is written to file_b.txt (adjust names as needed)

A non-regex awk solution:
awk -F'|' '{print $0>(index($1,"A")==1 && length($1)==6 ? "file_a.txt" : "file_b.txt")}' file

With your shown samples, could you please try following. Since your shown samples are NOT started from A so I have not added that Logic here, also this solution makes sure 1st field is all 6 digits long as per shown samples.
awk -F'|' '$1~/^[0-9]+$/ && length($1)==6{print > ("BeginsWith6.txt");next} {print > ("rest.txt")}' Input_file
2nd solution: In case your 1st field starts from A following with 5 digits(which you state but not there in your shown samples) then try following.
awk -F'|' '$1~/^A[0-9]+$/ && length($1)==6{print > ("BeginsWith6.txt");next} {print > ("rest.txt")}' Input_file
OR(better version of above):
awk -F'|' '$1~/^A[0-9]{5}$/{print > ("BeginsWith6.txt");next} {print > ("rest.txt")}' Input_file

Split text file into blocks and save

I have a big text file saved in the name test.txt.Now i want to split the big text files into blocks at ... symbol and want to save in the name same as what is there after /home/niu/ . (In the data example below i need the blocks of data to be saved in 20190630_073410_1.5_29_PCK.txt for the first block, 20180630_073410_1.5_29_PCK.txt for second block and 20190830_093410_1.5_29_PCK.txt for third block.
Thus i tried the code below:
#!/bin/sh
for file in 'test.txt'
do
split -l '...'
done
It does not work: i hope somebody will help me.Thanks.
My data saved in test.txt is given below:
...........................................................................................................
/home/niu/20190630_073410_1.5_29_PCK.txt 470.2359935984357 41573823894247.63 53.46648291467124 216 1 0.1
/home/niu/20190630_073410_1.5_29_PCK.txt 13.124782961287574 219608788311302.7 53.46425102814092 219 1 0.6
/home/niu/20190630_073410_1.5_29_PCK.txt 4.092419925137149 12174862157739.746 53.44206693334351 291 1 1.1
...........................................................................................................
/home/niu/20180630_073410_1.5_29_PCK.txt 2.241494955966288 363350265475740.4 53.36874778729164 219 1 0.1
/home/niu/20180630_073410_1.5_29_PCK.txt 1.6671382966847936 282579486756.3921 53.234249504389624 218 1 2.1
/home/niu/20180630_073410_1.5_29_PCK.txt 1.4410832347641427 17729080367.579777 53.06935945567802 216 1 2.6
...........................................................................................................
/home/niu/20190830_093410_1.5_29_PCK.txt 1.2367527642969733 5141.577700615736 52.776493933960644 127 0 3.6
/home/niu/20190830_093410_1.5_29_PCK.txt 1.171644866817557 3279.978138771641 52.65760209064783 135 0 4.1
/home/niu/20190830_093410_1.5_29_PCK.txt 1.120249969361367 2441.45977994814 52.54882982584634 105 0 4.6

awk '/\.\.\./{close(out); next} {split($1, a, "/"); out=a[4]; print > out}' file
You can use this awk. I have assumed that dots (...) exist only in the separating lines, also all the other lines starts with /home/niu/filename.txt, from where we get the output filename. If this is not the case, please update the question.

you can use csplit like this:
csplit test.txt '/^\./' {*}

Could you please try following, written and tested with shown samples in GNU awk.
awk -F'[ /]' '
!NF || /^\.+/{
next
}
out_file!=$4{
close(out_file)
out_file=$4
}
{
print >> (out_file)
}' Input_file
Explanation: Adding detailed explanation for above.
awk -F'[ /]' ' ##Starting awk program from here and setting space and / for all lines.
!NF || /^\.+/{ ##Checking condition if number of fields is NULL OR line starting from dot then do following.
next ##next will skip all further statements from here.
}
out_file!=$4{ ##Checking condition if prev is NOT equal to out_file then do following.
close(out_file) ##Closing file in back end to avoid too many files opened error here.
out_file=$4 ##Setting out_file as 4th field here.
}
{
print >> (out_file) ##Printing current line to out_file output file.
}' Input_file ##Mentioning Input_file name here.
EDIT: As per OP there could be lines starting with spaces so in that case try.
awk -F'/' '
!NF || /^\./{
next
}
{
split($4,arr," ")
}
out_file!=arr[1]{
close(out_file)
out_file=arr[1]
}
{
print >> (out_file)
}' Input_file

How can you compare entries between two columns in linux?

I am trying to figure out whether the first letter of an amino acid is the same as its letter code.
For example, Glycine begins with G and its letter code is also (G)
On the other hand, Arginine begins with A but its letter code is (R)
I am trying to print out, as a result, the amino acids that have the same letter code and starting alphabet.
I have a CSV datafile in which the columns are delimited by ','
Name,One letter code,Three letter code,Hydropathy,Charge,Abundance,DNA codon(s)
Arginine,R,Arg,hydrophilic,+,0.0514,CGT-CGC-CGA-CGG-AGA-AGG
Asparagine,N,Asn,hydrophilic,N,0.0447,AAT-AAC
Aspartate,D,Asp,hydrophilic,-,0.0528,GAT-GAC
Glutamate,E,Glu,hydrophilic,-,0.0635,GAA-GAG
Glutamine,Q,Gln,hydrophilic,N,0.0399,CAA-CAG
Lysine,K,Lys,hydrophilic,+,0.0593,AAA-AAG
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
I believe the code below is one way to compare columns, but I am wondering how I can extract the first letter from the first column and compare that with the alphabet in the second column
awk '{ if ($1 == $2) { print $1; } }' < foo.txt

Could you please try following.
awk 'BEGIN{FS=","} substr($1,1,1) == $2' Input_file
Output will be as follows.
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
Explanation: Adding explanation for above code.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section for awk here.
FS="," ##Setting FS as comma here, field separator.
} ##Closing BLOCK for BEGIN here.
substr($1,1,1) == $2 ##Using substr function of awk to get sub string from line, substr(line/variable/field, starting point, ending point) is method for using it. Getting 1st letter of $1 and comparing it with $2 of current line, if TRUE then it will print current line.
' Input_file ##Mentioning Input_file name here.

Simpler way using grep:
$ grep -E '^(.)[^,]*,\1' input.csv 
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG

Same as RavinderSingh's expression, but field selector attribute is different.
awk -F "," 'substr($1,1,1) == $2' InFile
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG

Grouping related rows of data into a single column in Linux

I have a csv file that gets generated daily and automatically that has output similar to the following example:
"N","3.5",3,"Bob","10/29/17"
"Y","4.5",5,"Bob","10/11/18"
"Y","5",6,"Bob","10/28/18"
"Y","3",1,"Jim",
"N","4",2,"Jim","09/29/17"
"N","2.5",4,"Joe","01/26/18"
I need to transform the text so that it is grouped by person (the fourth column), and all of the records are in a single row and in the columns are repeated using the same sequence: 1,2,3,5. Some cells may be missing data but must remain in the sequence so the columns line up. So the output I need will look like this:
"Bob","N","3.5",3,"10/29/17","Y","4.5",5,"10/11/18","Y","5",6,"10/28/18"
"Jim","Y","3",1,,"N","4",2,"09/29/17"
"Joe","N","2.5",4,"01/26/18"
I am open to using sed, awk, or pretty much any standard Linux command to get this task done. I've been trying to use awk, and though I get close, I can't figure out how to finish it.
Here is the command where I'm close. It lists the header and the names, but no other data:
awk -F"," 'NR==1; NR>1 {a[$4]=a[$4] ? i : ""} END {for (i in a) {print i}}' test2.csv

you need little more code
$ awk 'BEGIN {FS=OFS=","}
{k=$4; $4=$5; NF--; a[k]=(k in a?a[k] FS $0:$0)}
END {for(k in a) print k,a[k]}' file
"Bob","N","3.5",3,"10/29/17" ,"Y","4.5",5,"10/11/18" ,"Y","5",6,"10/28/18"
"Jim","Y","3",1, ,"N","4",2,"09/29/17"
"Joe","N","2.5",4,"01/26/18"
note that NF-- trick may not work in all awks.

Could you please try following too, reading the Input_file 2 times, it will provide output in same sequence in which 4th column has come in Input_file.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
a[$4]=a[$4]?a[$4] OFS $1 OFS $2 OFS $3 OFS $5:$4 OFS $1 OFS $2 OFS $3 OFS $5
next
}
a[$4]{
print a[$4]
delete a[$4]
}
' Input_file Input_file

If there is any chance that any of the CSV values has a comma, then a "CSV-aware" tool will would be advisable to obtain a reliable but straightforward solution.
One approach would be to use one of the many readily available csv2tsv command-line tools. A variety of elegant solutions then becomes possible. For example, one could pipe the CSV into csv2tsv, awk, and tsv2csv.
Here is another solution that uses csv2tsv and jq:
csv2tsv < input.csv | jq -Rrn '
[inputs | split("\t")]
| group_by(.[3])[]
| sort_by(.[2])
| [.[0][3]] + ( map( del(.[3])) | add)
| #csv
'
This produces:
"Bob","N","3.5","3","10/29/17 ","Y","4.5","5","10/11/18 ","Y","5","6","10/28/18 "
"Jim","Y","3","1"," ","N","4","2","09/29/17 "
"Joe","N","2.5","4","01/26/18"
Trimming the excess spaces is left as an exercise :-)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Select sepcific strings in different columns and print it AWK - search

Related

Match lines based on patterns and reformat file Bash/ Linux

Splitting file based on first column's first character and length

Split text file into blocks and save

How can you compare entries between two columns in linux?

Grouping related rows of data into a single column in Linux

Categories

Resources