Remove \r\n in awk - linux

I have a simple awk command that converts a date from MM/DD/YYYY to YYYY/MM/DD. However, the file I'm using has \r\n at the end of the lines, and sometimes the date is at the end of the line.
awk '
BEGIN { FS = OFS = "|" }
{
split($27, date, /\//)
$27 = date[3] "/" date[1] "/" date[2]
print $0
}
' file.txt
In this case, if the date is MM/DD/YYYY\r\n then I end up with this in the output:
YYYY
/MM/DD
What is the best way to get around this? Keep in mind, sometimes the input is simply \r\n in which case the output SHOULD be // but instead ends up as
/
/

Given that the \r isn't always at the end of field $27, the simplest approach is to remove the \r from the entire line.
With GNU Awk or Mawk (one of which is typically the default awk on Linux platforms), you can simply define your input record separator, RS, accordingly:
awk -v RS='\r\n' ...
Or, if you want \r\n-terminated output lines too, set the output record separator, ORS, to the same value:
awk 'BEGIN { RS=ORS="\r\n"; ...
Optional reading: an aside for BSD/macOS Awk users:
BSD/macOS awk doesn't support multi-character RS values (in line with the POSIX Awk spec: "If RS contains more than one character, the results are unspecified").
Therefore, a sub call inside the Awk script is necessary to trim the \r instance from the end of each input line:
awk '{ sub("\r$", ""); ...
To also output \r\n-terminated lines, option -v ORS='\r\n' (or ORS="\r\n" inside the script's BEGIN block) will work fine, as with GNU Awk and Mawk.

If you're on a system where \n by itself is the newline, you should remove the \r from the record. You could do it like:
$ awk '{sub(/\r/,"",$NF); ...}'

Related

How to remove double quotes in a specific column by using sub() in AWK

My sample data is
cat > myfile
"a12","b112122","c12,d12"
a13,887988,c13,d13
a14,b14121,c79,d13
when I try to remove " from colum 2 by
awk -F, 'BEGIN { OFS = FS } $2 ~ /"/ { sub(/"/, "", $2) }1' myfile
"a12",b112122","c12,d12"
a13,887988,c13,d13
a14,b14121,c79,d13
It only remove only 1 comma, instead of b112122 i am getting b112122"
how to remove all " in 2nd column
From the documentation:
Search target, which is treated as a string, for the leftmost, longest substring matched by the regular expression regexp.[...] Return the number of substitutions made (zero or one).
It is quite clear that the function sub is doing at most one single replacement and does not replace all occurences.
Instead, use gsub:
Search target for all of the longest, leftmost, nonoverlapping matching substrings it can find and replace them with replacement. The ‘g’ in gsub() stands for “global,” which means replace everywhere.
So you can add a 'g' to your line and it works fine:
awk -F, 'BEGIN { OFS = FS } $2 ~ /"/ { gsub(/"/, "", $2) }1' myfile
When you dealing with CSV file, not using FPAT, it will break sooner or later.
Here is a gnu awk that does the jib.
awk -v OFS="," -v FPAT="([^,]+)|(\"[^\"]+\")" '{gsub(/"/,"",$2)}1' file
"a12",b112122,"c12,d12"
a13,887988,c13,d13
a14,b14121,c79,d13
It will work fine on any column, number 3 as well.
Example on remove " on column 3 at the same time change separator to |
awk -v OFS="|" -v FPAT="([^,]+)|(\"[^\"]+\")" '{gsub(/"/,"",$3);$1=$1}1' file
"a12"|"b112122"|c12,d12
a13|887988|c13|d13
a14|b14121|c79|d13

How to pass column name of a file as variable in awk

I trying to printing the data in the particular columns by passing them int awk command
I have tried using "-v" to set is as variable but its considering "$" as string. And my delimiter is special character ^A (ctrl+v+a).
vi test_file.dat
a^Ab^Ac^Ad^Ae^Af^Ag^Ah^Ai^Aj^Ak^Al^Am^An^Ao^Ap
Working code
awk -F'^A' '{print $2,$5,$7}' test_file.dat
It's Printing
b e g
But if I try
export fields='$2,$5,$7'
export file='test_file.dat'
awk -v sample_file="$test_file.dat" -v columns="$fileds" -F'^A' '{print columns}' sample_file
It's printing
$2 $5 $7
I expect the output as
b e g
And I want to pass the delimiter, columns, file name as a parameter like
export fields='$2,$5,$7'
export file='test_file.dat'
export delimiter='^A'
awk -v sample_file="$test_file.dat" -v columns="$fields" -v file_delimiter="$delimiter" -F'file_delimiter' '{print columns}' sample_file
In awk, the $ symbol is effectively an operator which takes the field numbers as arguments. The field names are expressions, which is why $NF works for denoting the last field: NF is evaluated by the $ operator. So as you can see, we should not include the dollar sign in the field names.
If you're using the environment to pass material to Awk, the right thing to do is to have Awk pick it up from the environment.
The environment can be accessed using the ENVIRON associative array. If a variable called delimiter holds the field separator, we might do something like
BEGIN { FS = ENVIRON["delimiter"] }
in the Awk code. Then we aren't dealing with yet another round of shell parameter interpolation issues.
We can pick up the field numbers similarly. The split function can be used to get them into an array. Refer to this one-liner:
$ fields=1,3,9 awk 'BEGIN { split(ENVIRON["fields"], f, ",") ;
for (i in f)
printf("f[%d] = %d\n", i, f[i]) ; }'
f[1] = 1
f[2] = 3
f[3] = 9
GNU Awk, the expression length(f) gives the number of fields.
In order to get awk to see the special characters while reading the file you could use cat -v file (there might be a built-in method, although I'm not aware of it). Then the key to getting the character ^A (Control-A) delimiter to be recognized is to escape it with a \, otherwise the regex capability of awk,gawk, etc. is to treat ^ as start of line.
export fields='$2,$5,$7'
export test_file='test_file.dat'
export delimiter='\\^A'
awk -F $delimiter '{ print '$fields' }' < <(cat -v test_file)
There's also no need to set awk variables for bash variables that have already set — so you can eliminate all of them essentially.
One thing to note if you did want to set them in awk is that columns wouldn't work because usually setting an awk variable from a bash variable would be assigned individually. For example -v var1='$2' -v var2='$5' -v var3='$7', so you'd end up for { print var1, var2, var3 } in awk. It's doubtful a single string can translated it into three variables without additional steps.

gsub in awk with variable

I want to replace the ">" with variable names staring with ">" and ends with ".". But the following code is not printing the variable names.
for f in *.fasta;
do
nam=$(basename $f .fasta);
awk '{print $f}' $f | awk '{gsub(">", ">$nam."); print $0}'; done
Input of first file sample01.fasta:
cat sample01.fasta:
>textofDNA
ATCCCCGGG
>textofDNA2
ATCCCCGGGTTTT
Output expected:
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
$ awk 'FNR==1{fname=FILENAME; sub(/[^.]+$/,"",fname)} sub(/^>/,""){$0=">" fname $0} 1' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Compared to the other answers you've got so far, the above will work in any awk, only does the file name calculation once per input file rather than once per line or once per >-line, won't fail if the file name contains other .s, won't fail if the file name contains &, and won't fail if the file name doesn't contain the string fasta..
Or like this? You don't really need the looping and basename or two awk invocations.
awk '{stub=gensub( /^([^.]+\.)fasta.*/ , "\\1", "1",FILENAME ) ; gsub( />/, ">"stub); print}' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Explanation: awk has knowledge of the filename it currently operates on through the built-in variable FILENAME; I strip the .fasta extension using gensub, and store it in the variable stub. The I invoke gsub to replace ">" with ">" and the content of my variable stub. After that I print it.
As Ed points out in the comments: gensub is a GNU extension and won't work on other awk implementations.
Could you please try following too.
awk '/^>/{split(FILENAME,array,".");print substr($0,1,1) array[1]"." substr($0,2);next} 1' Input_file
Explanation: Adding explanation for above code here.
awk '
/^>/{ ##Checking condition if a line starts from > then do following.
split(FILENAME,array,".") ##Using split function of awk to split Input_file name here which is stored in awk variable FILENAME.
print substr($0,1,1) array[1]"." substr($0,2) ##Printing substring to print 1st char then array 1st element and then substring from 2nd char to till last of line.
next ##next will skip all further statements from here.
}
1 ##1 will print all lines(except line that are starting from >).
' sample01.fasta ##Mentioning Input_file name here.

replace string in a file with a string from within the same file

I have a file like this (tens of variables) :
PLAY="play"
APPS="/opt/play/apps"
LD_FILER="/data/mysql"
DATA_LOG="/data/log"
I need a script that will output the variables into another file like this (with space between them):
PLAY=${PLAY} APPS=${APPS} LD_FILER=${LD_FILER}
Is it possible ?
I would say:
$ awk -F= '{printf "%s=${%s} ", $1,$1} END {print ""}' file
PLAY=${PLAY} APPS=${APPS} LD_FILER=${LD_FILER} DATA_LOG=${DATA_LOG}
This loops through the file and prints the content before = in a format var=${var} together with a space. At the end, it prints a new line.
Note this leaves a trailing space at the end of the line. If this matters, we can check how to improve it.
< input sed -e 's/\(.*\)=.*/\1=${\1}/' | tr \\n \ ; echo
sed 's/"\([^"]*"\)"/={\1}/;H;$!d
x;y/\n/ /;s/.//' YourFile
your sample exclude last line so if this is important
sed '/DATA_LOG=/ d
s/"\([^"]*"\)"/={\1}/;H;$!d
x;y/\n/ /;s/.//' YourFile

AWK remove blank lines

The /./ is removing blank lines for the first condition { print "a"$0 } only, how would I ensure the script removes blank lines for every condition ?
awk -F, '/./ { print "a"$0 } NR!=1 { print "b"$0 } { print "c"$0 } END { print "d"$0 }' MyFile
A shorter form of the already proposed answer could be the following:
awk NF file
Any awk script follows the syntax condition {statement}. If the statement block is not present, awk will print the whole record (line) in case the condition is not zero.
NF variable in awk holds the number of fields in the line. So when the line is non empty, NF holds a positive value which trigger the default awk action (print the whole line). In case of empty line, NF is zero and the condition is not met, so awk does nothing.
Note that you don't even need quote because this 2 letters awk script doesn't contain any space or character that could be interpreted by the shell.
or
awk '!/^$/' file
^$ is the regex for an empty line. The 2 / is needed to let awk understand the string is a regex. ! is the standard negation.
Awk command to remove blank lines from a file:
awk 'NF > 0' filename
if you want to ignore all blank lines, put this at the beginning of the script
/^$/ {next}
Put following conditions inside the first one, and check them with if statements, like this:
awk -F, '
/./ {
print "a"$0;
if (NR!=1) { print "b"$0 }
print "c"$0
}
END { print "d"$0 }
' MyFile

Resources