Print Lines That Are Greater Than Two Fields - linux

I'm okay with grep, but I know that awk is probably way more efficient in this case. I'm learning but not quite there yet.
I have some data:
record1,14.2,10,50
record2,10.7,5,-
record3,9.3,6.8,10
record4,8,2.7,10
record5,5.5,22.4,10
record6,3,23.6,55
record7,2.7,14.6,-
I would like to print only the lines that are greater than greater than 7 in field 3 and greater than 10 (while removing any dashs) in field 4. Thus, the output would be this:
record1,14.2,10,50
record6,3,23.6,55
I have played around using awk '{print $3 > 7}', however, like i said, I'm not great with awk and conditions. I could do it with grep but I feel like that's inefficient. Any help is greatly appreciated.

The structure of an awk script is condition { action }. The default action is { print }, which prints the whole record.
Your conditions are $3 > 7 and $4 > 10.
Your field separator is a comma.
Combining those things we get:
awk -F, '$3 > 7 && $4 > 10' file

Related

Filtering a file with values over 0.70 using AWK

I have a file of targets predicted by Diana and I would like to extract those with values over 0.70
>AAGACAACGUUUAAACCA|ENST00000367816|0.999999999975474
UTR3 693-701 0.00499294596715397
UTR3 1045-1053 0.405016433077734
>AAGACAACGUUUAAACCA|ENST00000392971|0.996695852735028
CDS 87-95 0.0112208345874892
I don't know why this script doesn't want to work if it seems to be correct
for file in SC*
do
grep ">" $file | awk 'BEGIN{FS="|"}{if($3 >= 0.70)}{print $2, $3}' > 70/$file.tab
done
The issue is it doesn't filter, can you help me to find out the error?
For a start, that's not a valid awk script since you have a misplaced } character:
BEGIN{FS="|"}{if($3 >= 0.70)}{print $2, $3}
# |
# +-------------+
# move here |
# V
BEGIN{FS="|"}{if($3 >= 0.70){print $2, $3}}
You also don't need grep because awk can do that itself, and you can also set the field separator without a BEGIN block. For example, here's a command that will output field 3 values greater than 0.997, on lines starting with > (using | as a field separator):
pax> awk -F\| '/^>/ && $3 > 0.997 { print $3 }' prog.in
0.999999999975474
I chose 0.997 to ensure one of the lines in your input file was filtered out for being too low (as proof that it works). For your desired behaviour, the command would be:
pax> awk -F\| '/^>/ && $3 > 0.7 { print $2, $3 }' prog.in
ENST00000367816 0.999999999975474
ENST00000392971 0.996695852735028
Keep in mind I've used > 0.7 as per your "values over 0.70" in the heading and text of your question. If you really mean "values 0.70 and above" as per the code in your question, simply change > into >=.
Looks like you are running a for loop to kick off awk program multiple times(it means each time a file processes an awk program process will be kicked off), you need not to do that, awk program could read all the files with same name/format by itself, so apart from fixing your typo in awk program pass all files into your awk program too like:
awk -F\| 'FNR==1{close(out); out="70/"FILENAME".tab"} /^>/ && $3 > 0.7 { print $2,$3 > out }' SC*
i think it's perhaps safe to regex filter in string mode, instead of numerically :
$3 !~/0[.][0-6]/
if it started to interpret the input as a number, and does a numeric compare, that would be subject to rounding errors limited to float-point math. with a string-based filter, you could avoid values above
~ 0 . 699 999 999 999 999 95559107901… (approx. IEEE754 double-precision of 7E-1 )
being rounded up.

Splitting file based on first column's first character and length

I want to split a .txt into two, with one file having all lines where the first column's first character is "A" and the total of characters in the first column is 6, while the other file has all the rest. Searching led me to find the awk command and ways to separate files based on the first character, but I couldn't find any way to separate it based on column length.
I'm not familiar with awk, so what I tried (to no avail) was awk -F '|' '$1 == "A*****" {print > ("BeginsWithA.txt"); next} {print > ("Rest.txt")}' FileToSplit.txt.
Any help or pointers to the right direction would be very appreciated.
EDIT: As RavinderSingh13 reminded, it would be best for me to put some samples/examples of input and expected output.
So, here's an input example:
#FileToSplit.txt#
2134|Line 1|Stuff 1
31516784|Line 2|Stuff 2
A35646|Line 3|Stuff 3
641|Line 4|Stuff 4
A48029|Line 5|Stuff 5
A32100|Line 6|Stuff 6
413|Line 7|Stuff 7
What the expected output is:
#BeginsWith6.txt#
A35646|Line 3|Stuff 3
A48029|Line 5|Stuff 5
A32100|Line 6|Stuff 6
#Rest.txt#
2134|Line 1|Stuff 1
31516784|Line 2|Stuff 2
641|Line 4|Stuff 4
413|Line 7|Stuff 7
What you want to do is use a regex and length function. You don't show your input, so I will leave it to you to set the field separator. Given your description, you could do:
awk '/^A/ && length($1) == 6 { print > "file_a.txt"; next } { print > "file_b.txt" }' file
Which would take the information in file and if the first field begins with "A" and is 6 characters in length, the record is written to file_a.txt, otherwise the record is written to file_b.txt (adjust names as needed)
A non-regex awk solution:
awk -F'|' '{print $0>(index($1,"A")==1 && length($1)==6 ? "file_a.txt" : "file_b.txt")}' file
With your shown samples, could you please try following. Since your shown samples are NOT started from A so I have not added that Logic here, also this solution makes sure 1st field is all 6 digits long as per shown samples.
awk -F'|' '$1~/^[0-9]+$/ && length($1)==6{print > ("BeginsWith6.txt");next} {print > ("rest.txt")}' Input_file
2nd solution: In case your 1st field starts from A following with 5 digits(which you state but not there in your shown samples) then try following.
awk -F'|' '$1~/^A[0-9]+$/ && length($1)==6{print > ("BeginsWith6.txt");next} {print > ("rest.txt")}' Input_file
OR(better version of above):
awk -F'|' '$1~/^A[0-9]{5}$/{print > ("BeginsWith6.txt");next} {print > ("rest.txt")}' Input_file

AWK make a simple subtract and find the minimum value of that

I have this matrix:
{{1,4},{6,8}}
and I want to substract the second value from the first value like: 4-1 and 8-6
and then, comparer both and show what was the minimun value from both, in this case: 8-6=2
All of this using AWK in terminal
You seem a little confused about whether you want to subtract the first from the second or the second from the first. Also, about whether your data is in a file or a variable. However, this should get you started...
If we replace any mad braces or commas with spaces:
echo "{{1,4},{6,8}}" | awk '{gsub(/[{},]/," "); print}'
1 4 6 8
Now we can access the fields as $1 through $4 and do what you want:
echo "{{1,4},{6,8}}" | awk '{gsub(/[{},]/," "); x=$2-$1; y=$4-$3; if(x<y)print x; else print y}'
2
As a, maybe more elegant, alternative suggested by #3161993 in the comments, you could set the field separator to be one or more open or close braces or commas, like this:
awk -F '[,{}]+' '{x=$3-$2; y=$5-$4; if(x<y) print x; else print y}' <<< "{{1,4},{6,8}}"
2
And, as #EdMorton kindly pointed out, it can be made a bit more succinct with a ternary operator like this:
awk -F '[,{}]+' '{x=$3-$2; y=$5-$4; print (x<y ? x : y)}' <<< "{{1,4},{6,8}}"

read different fields and pass on to awk to extract those fields

Probably this is answered somewhere, but the things I have explored is not matching my need.
I would like to read different fields from one file (FILE1) and pass this on to a awk script, which can extract those fields from another file (FILE2).
FILE1
1 156202173 156702173
2 26915624 27415624
4 111714419 112214419
so read lines from this file and pass it on to the following script
awk ' BEGIN {FS=OFS="\t"};
{if ($1==$1 && $2>= $2 && $2<= $3 ) {print $0}}' FILE2 > extracted.file
The FILE2 looks like this;
1 156202182 rs7929618
16 8600861 rs7190157
4 111714800 rs12364336
12 3840048 rs4766166
7 20776538 rs35621824
so the awk script print only when there is a match with the first field and the value falls between the 2nd and 3rd field.
Expected output is
1 156202182 rs7929618
4 111714800 rs12364336
Thanks so much in advance for your response.
there should be plenty of similar questions but writing the script is faster than looking up.
$ awk 'NR==FNR{lower[$1]=$2; upper[$1]=$3; next}
lower[$1]<$2 && $2<upper[$1]' file1 file2
1 156202182 rs7929618
4 111714800 rs12364336

Retaining one member of a pair

Good afternoon to all,
I have a file containing two fields, each representing a member of a pair.
I want to retain one member of each pair and it does not matter which member as these are codes for duplicate samples in a study.
Each pair appears twice in my file, with each member of the pair appearing once in either column.
An example of an input file is:
XXX1 XXX7
XXX2 XXX4
abc2 dcb3
XXX7 XXX1
dcb3 abc2
XXX4 XXX2
And an example of the desired output would be
XXX1
XXX2
abc2
How might this be accomplished in bash? Thank you.
Here is a combination of GNU awk, cut and sort, store the scipt as duplicatePairs.awk:
{ if ( $1 < $2) print $1, $2
else print $2, $1
}
and run it like this: awk -f duplicatePairs.awk your_file | sort -u | cut -d" " -f1
The if sorts the pairs such that a line with x,y and a line with y,x will be printed the same. Then sort -u can remove the duplicate lines. And the cut selects the first column.
With a slightly larger awk script, we can solve the requirements "awk-only":
{
smallest = $1;
if ( $1 > $2) {
smallest = $2
}
if( !(smallest in seen) ) {
seen [ smallest ] = 1
print smallest
}
}
Run it like this: awk -f duplicatePairs.awk your_file
While the answer posted by Lars above works very well I would like to suggest an alternative, just in case someone stumbles upon this problem.
I had previously used awk '!seen[$2,$1]++ {print $1}' to the same result. I didn't realize it had worked since the number of lines in my file wasn't halved. This turned out to be because of some wrong assumptions I made about my data.

Resources