Splitting file based on first column's first character and length - linux

I want to split a .txt into two, with one file having all lines where the first column's first character is "A" and the total of characters in the first column is 6, while the other file has all the rest. Searching led me to find the awk command and ways to separate files based on the first character, but I couldn't find any way to separate it based on column length.
I'm not familiar with awk, so what I tried (to no avail) was awk -F '|' '$1 == "A*****" {print > ("BeginsWithA.txt"); next} {print > ("Rest.txt")}' FileToSplit.txt.
Any help or pointers to the right direction would be very appreciated.
EDIT: As RavinderSingh13 reminded, it would be best for me to put some samples/examples of input and expected output.
So, here's an input example:
#FileToSplit.txt#
2134|Line 1|Stuff 1
31516784|Line 2|Stuff 2
A35646|Line 3|Stuff 3
641|Line 4|Stuff 4
A48029|Line 5|Stuff 5
A32100|Line 6|Stuff 6
413|Line 7|Stuff 7
What the expected output is:
#BeginsWith6.txt#
A35646|Line 3|Stuff 3
A48029|Line 5|Stuff 5
A32100|Line 6|Stuff 6
#Rest.txt#
2134|Line 1|Stuff 1
31516784|Line 2|Stuff 2
641|Line 4|Stuff 4
413|Line 7|Stuff 7

What you want to do is use a regex and length function. You don't show your input, so I will leave it to you to set the field separator. Given your description, you could do:
awk '/^A/ && length($1) == 6 { print > "file_a.txt"; next } { print > "file_b.txt" }' file
Which would take the information in file and if the first field begins with "A" and is 6 characters in length, the record is written to file_a.txt, otherwise the record is written to file_b.txt (adjust names as needed)

A non-regex awk solution:
awk -F'|' '{print $0>(index($1,"A")==1 && length($1)==6 ? "file_a.txt" : "file_b.txt")}' file

With your shown samples, could you please try following. Since your shown samples are NOT started from A so I have not added that Logic here, also this solution makes sure 1st field is all 6 digits long as per shown samples.
awk -F'|' '$1~/^[0-9]+$/ && length($1)==6{print > ("BeginsWith6.txt");next} {print > ("rest.txt")}' Input_file
2nd solution: In case your 1st field starts from A following with 5 digits(which you state but not there in your shown samples) then try following.
awk -F'|' '$1~/^A[0-9]+$/ && length($1)==6{print > ("BeginsWith6.txt");next} {print > ("rest.txt")}' Input_file
OR(better version of above):
awk -F'|' '$1~/^A[0-9]{5}$/{print > ("BeginsWith6.txt");next} {print > ("rest.txt")}' Input_file

Related

Replacing a string in the beginning of some rows in two columns with another string in linux

I have a tab separated text file. In column 1 and 2 there are family and individual ids that start with a character followed by number as follow:
HG1005 HG1005
HG1006 HG1006
HG1007 HG1007
NA1008 NA1008
NA1009 NA1009
I would like to replace NA with HG in both the columns. I am very new to linux and tried the following code and some others:
awk '{sub("NA","HG",$2)';print}' input file > output file
Any help is highly appreciated.
Converting my comment to answer now, use gsub in spite of sub here. Because it will globally substitute NA to HG here.
awk 'BEGIN{FS=OFS="\t"} {gsub("NA","HG");print}' inputfile > outputfile
OR use following in case you have several fields and you want to perform substitution only in 1st and 2nd fields.
awk 'BEGIN{FS=OFS="\t"} {sub("NA","HG",$1);sub("NA","HG",$2);print}' inputfile > outputfile
Change sub to gsub in 2nd code in case multiple occurrences of NA needs to be changed within field itself.
The $2 in your call to sub only replaces the first occurrence of NA in the second field.
Note that while sed is more typical for such scenarios:
sed 's/NA/HG/g' inputfile > outputfile
you can still use awk:
awk '{gsub("NA","HG")}1' inputfile > outputfile
See the online demo.
Since there is no input variable in gsub (that performs multiple search and replaces) the default $0 is used, i.e. the whole record, the current line, and the code above is equal to awk '{gsub("NA","HG",$0)}1' inputfile > outputfile.
The 1 at the end triggers printing the current record, it is a shorter variant of print.
Notice /^NA/ position at the beginning of field:
awk '{for(i=1;i<=NF;i++)if($i ~ /^NA/) sub(/^NA/,"HG",$(i))} 1' file
HG1005 HG1005
HG1006 HG1006
HG1007 HG1007
HG1008 HG1008
HG1009 HG1009
and save it:
awk '{for(i=1;i<=NF;i++)if($i ~ /^NA/) sub(/^NA/,"HG",$(i))} 1' file > outputfile
If you have a tab as separator:
awk 'BEGIN{FS=OFS="\t"} {for(i=1;i<=NF;i++)if($i ~ /^NA/) sub(/^NA/,"HG",$(i))} 1' file > outputfile

read different fields and pass on to awk to extract those fields

Probably this is answered somewhere, but the things I have explored is not matching my need.
I would like to read different fields from one file (FILE1) and pass this on to a awk script, which can extract those fields from another file (FILE2).
FILE1
1 156202173 156702173
2 26915624 27415624
4 111714419 112214419
so read lines from this file and pass it on to the following script
awk ' BEGIN {FS=OFS="\t"};
{if ($1==$1 && $2>= $2 && $2<= $3 ) {print $0}}' FILE2 > extracted.file
The FILE2 looks like this;
1 156202182 rs7929618
16 8600861 rs7190157
4 111714800 rs12364336
12 3840048 rs4766166
7 20776538 rs35621824
so the awk script print only when there is a match with the first field and the value falls between the 2nd and 3rd field.
Expected output is
1 156202182 rs7929618
4 111714800 rs12364336
Thanks so much in advance for your response.
there should be plenty of similar questions but writing the script is faster than looking up.
$ awk 'NR==FNR{lower[$1]=$2; upper[$1]=$3; next}
lower[$1]<$2 && $2<upper[$1]' file1 file2
1 156202182 rs7929618
4 111714800 rs12364336

Print Lines That Are Greater Than Two Fields

I'm okay with grep, but I know that awk is probably way more efficient in this case. I'm learning but not quite there yet.
I have some data:
record1,14.2,10,50
record2,10.7,5,-
record3,9.3,6.8,10
record4,8,2.7,10
record5,5.5,22.4,10
record6,3,23.6,55
record7,2.7,14.6,-
I would like to print only the lines that are greater than greater than 7 in field 3 and greater than 10 (while removing any dashs) in field 4. Thus, the output would be this:
record1,14.2,10,50
record6,3,23.6,55
I have played around using awk '{print $3 > 7}', however, like i said, I'm not great with awk and conditions. I could do it with grep but I feel like that's inefficient. Any help is greatly appreciated.
The structure of an awk script is condition { action }. The default action is { print }, which prints the whole record.
Your conditions are $3 > 7 and $4 > 10.
Your field separator is a comma.
Combining those things we get:
awk -F, '$3 > 7 && $4 > 10' file

counting string length before and after a match, line by line in bash or sed

I have a file 'test' of DNA sequences, each with a header or ID like so:
>new
ATCGGC
>two
ACGGCTGGG
>tre
ACAACGGTAGCTACTATACGGTCGTATTTTTT
I would like to print the length of each contiguous string before and after a match to a given string, e.g. CGG
The output would then look like this:
>new
2 1
>two
1 5
>tre
4 11 11
or could just have the character lengths before and after matches for each line.
2 1
1 5
4 11 11
My first attempts used sed to print the next line after finding '>' ,then found the byte offset for each grep match of "CGG", which I was going to use to convert to lengths, but this produced the following:
sed -n '/>/ {n;p}' test | grep -aob "CGG"
2:CGG
8:CGG
21:CGG
35:CGG
Essentially, grep is printing the byte offset for each match, counting up, while I want the byte offset for each line independently (i.e. resetting after each line).
I suppose I need to use sed for the search as well, as it operates line by line, but Im not sure how to counnt the byte offset or characters in a given string.
Any help would be much appreciated.
By using your given string as the field separator in awk, it's as easy as iterating through the fields on each line and printing their lengths. (Lines starting with > we just print as they are.)
This gives the desired output for your sample data, though you'll probably want to check edge cases like starts with CGG, ends with CGG, only contains CGG, etc.
$ awk -F CGG '/^>/ {print; next} {for (i=1; i<=NF; ++i) {printf "%s%s", length($i), (i==NF)?"\n":" "}}' file.txt
>new
2 1
>two
1 5
>tre
4 11 11
awk -F CGG
Invoke awk using "CGG" as the field separator. This parses each line into a set of fields separated by each (if any) occurrence of the string "CGG". The "CGG" strings themselves are neither included as nor part of any field.
Thus the line ACAACGGTAGCTACTATACGGTCGTATTTTTT is parsed into the three fields: ACAA, TAGCTACTATA, and TCGTATTTTTT, denoted in the awk program by $1, $2, and $3, respectively.
'/^>/ {print; next}
This pattern/action tells awk that if the line starts with > to print the line and go immediately to the next line of input, without considering any further patterns or actions in the awk program.
{for (i=1; i<=NF; ++i) {printf "%s%s", length($i), (i==NF)?"\n":" "}}
If we arrive to this action, we know the line did not start with > (see above). Since there is only an action and no pattern, the action is executed for every line of input that arrives here.
The for loop iterates through all the fields (NF is a special awk variable that contains the number of fields in the current line) and prints their length. By checking if we've arrived at the last field, we know whether to print a newline or just a space.

AWK interpretation awk -F'AUTO_INCREMENT=' 'NF==1{print "0";next}{sub(/ .*/,"",$2);print $2}'

I've going through some simple bash scripts at work that someone else wrote month ago and I've found this line:
| awk -F'AUTO_INCREMENT=' 'NF==1{print "0";next}{sub(/ .*/,"",$2);print $2}'
Can someone help me to interpret this line in simple words. Thank you!
awk -F'AUTO_INCREMENT=' ' # Set 'AUTO_INCREMENT=' as a field separator
NF==1 { # If number of fields is one i.e. a blank line
print "0"; # print '0'
next # Go to next record i.e. skip following code
}
{
sub(/ .*/,"",$2); # Delete anything after a space in the second field
print $2 # Print the second field
}'
Example
Sample inputs
AUTO_INCREMENT=3
AUTO_INCREMENT=10 20 30 foo bar
Output
3
0
10

Resources