Does awk CR LF handling break on cygwin? - linux

On Linux, this runs as expected:
$ echo -e "line1\r\nline2"|awk -v RS="\r\n" '/^line/ {print "awk: "$0}'
awk: line1
awk: line2
But under windows the \r is dropped (awk considers this one line):
Windows:
$ echo -e "line1\r\nline2"|awk -v RS="\r\n" '/^line/ {print "awk: "$0}'
awk: line1
line2
Windows GNU Awk 4.0.1
Linux GNU Awk 3.1.8
EDIT from #EdMorton (sorry if this is an unwanted addition but I think maybe it helps demonstrate the issue):
Consider this RS setting and input (on cygwin):
$ awk 'BEGIN{printf "\"%s\"\n", RS}' | cat -v
"
"
$ echo -e "line1\r\nline2" | cat -v
line1^M
line2
This is Solaris with gawk:
$ echo -e "line1\r\nline2" | awk '1' | cat -v
line1^M
line2
and this is cygwin with gawk:
$ echo -e "line1\r\nline2" | awk '1' | cat -v
line1
line2
RS was just it's default newline so where did the control-M go in cygwin?

I just checked with Arnold Robbins (the provider of gawk) and the answer is that it's something done by the C libraries and to stop it happening you should set the awk BINMODE variable to 3:
$ echo -e "line1\r\nline2" | awk '1' | cat -v
line1
line2
$ echo -e "line1\r\nline2" | awk -v BINMODE=3 '1' | cat -v
line1^M
line2
See the man page for more info if interested.

It seems like the issue is awk specific under Cygwin.
I tried a few different things and it seems that awk is silently treating replacing \r\n with \n in the input data.
If we simply ask awk to repeat the text unmodified, it will "sanitize" the carriage returns without asking:
$ echo -e "line1\r\nline2" | od -a
0000000 l i n e 1 cr nl l i n e 2 nl
0000015
$ echo -e "line1\r\nline2" | awk '{ print $0; }' | od -a
0000000 l i n e 1 nl l i n e 2 nl
0000014
It will, however, leave other carriage returns intact:
$ echo -e "Test\rTesting\r\nTester\rTested" | awk '{ print $0; }' | od -a
0000000 T e s t cr T e s t i n g nl T e s
0000020 t e r cr T e s t e d nl
0000033
Using a custom record separator of _ ended up leaving the carriage returns intact:
$ echo -e "Testing\r_Tested" | awk -v RS="_" '{ print $0; }' | od -a
0000000 T e s t i n g cr nl T e s t e d nl
0000020 nl
0000021
The most telling example involves having \r\n in the data, but not as a record separator:
$ echo -e "Testing\r\nTested_Hello_World" | awk -v RS="_" '{ print $0; }' | od -a
0000000 T e s t i n g nl T e s t e d nl H
0000020 e l l o nl W o r l d nl nl
0000034
awk is blindly converting \r\n to \n in the input data even though we didn't ask it to.
This substitution seems to be happening before applying record separation, which explains why RS="\r\n" never matches anything. By the time awk is looking for \r\n, it's already substituted it with \n in the input data.

Related

awk add string to each line except last blank line

I have file with blank line at the end. I need to add suffix to each line except last blank line.
I use:
awk '$0=$0"suffix"' | sed 's/^suffix$//'
But maybe it can be done without sed?
UPDATE:
I want to skip all lines which contain only '\n' symbol.
EXAMPLE:
I have file test.tsv:
a\tb\t1\n
\t\t\n
c\td\t2\n
\n
I run cat test.tsv | awk '$0=$0"\t2"' | sed 's/^\t2$//':
a\tb\t1\t2\n
\t\t\t2\n
c\td\t2\t2\n
\n
It sounds like this is what you need:
awk 'NR>1{print prev "suffix"} {prev=$0} END{ if (NR) print prev (prev == "" ? "" : "suffix") }' file
The test for NR in the END is to avoid printing a blank line given an empty input file. It's untested, of course, since you didn't provide any sample input/output in your question.
To treat all empty lines the same:
awk '{print $0 (/./ ? "suffix" : "")}' file
#try:
awk 'NF{print $0 "suffix"}' Input_file
this will skip all blank lines
awk 'NF{$0=$0 "suffix"}1' file
to only skip the last line if blank
awk 'NR>1{print p "suffix"} {p=$0} END{print p (NF?"suffix":"") }' file
If perl is okay:
$ cat ip.txt
a b 1
c d 2
$ perl -lpe '$_ .= "\t 2" if !(eof && /^$/)' ip.txt
a b 1 2
2
c d 2 2
$ # no blank line for empty file as well
$ printf '' | perl -lpe '$_ .= "\t 2" if !(eof && /^$/)'
$
-l strips newline from input, adds back when line is printed at end of code due to -p option
eof to check end of file
/^$/ blank line
$_ .= "\t 2" append to input line
Try this -
$ cat f ###Blank line only in the end of file
-11.2
hello
$ awk '{print (/./?$0"suffix":"")}' f
-11.2suffix
hellosuffix
$
OR
$ cat f ####blank line in middle and end of file
-11.2
hello
$ awk -v val=$(wc -l < f) '{print (/./ || NR!=val?$0"suffix":"")}' f
-11.2suffix
suffix
hellosuffix
$

Filter lines with the other set of lines in Bash

I have lines on my standard input
$printf "C\nB\nA\n"
C
B
A
and I want to filter out lines (or substrings or regexps - whatever is easier) that appear on some other standard input:
$printf "B\nA\n"
B
A
I expect just C when entries get filtered.
I've tried with
$printf "C\nB\nA\n" | grep -v `printf "B\nA\n"`
But then I'm getting
grep: A: No such file or directory
How can I perform filtering of standard input by lines returned by other command?
You can use grep's -f option:
Matching Control
-f FILE, --file=FILE
Obtain patterns from FILE, one per line.
[...]
and use the <(command) syntax for using a command's output as the content to be used:
$ printf "C\nB\nA\nA\nC\nB" | grep -vf <(printf "A\nB\n")
C
C
Use awk to get control of how each line should match each other:
$ printf "C\nB\nA\n" | awk 'NF == FNR { a[$0] = 1; next } a[$0]' \
<(printf "A\nB\n") -
By changing a[$0] you can define how each should match, i.e to print lines
from file1.txt which first column are in file2.txt:
$ awk 'NF == FNR { a[$0] = 1; next } a[$1]' file2.txt file1.txt
# ^ Print if column 1 from file1.txt
# is in file2.txt
To print lines from file1.txt which are contained in column one from file2.txt:
$ awk 'NF == FNR { a[$1] = 1; next } a[$0]' file2.txt file1.txt
# ^ Print if line from file1.txt match
# ^ Store column one from file2.txt
U can use
printf "C\nB\nA\n" | grep -v C
and subsequently
printf "C\nB\nA\n" | grep -v C | grep -v B
then
printf "C\nB\nA\n" | grep -v A | grep -v B
if you want to do it from a file
assuming file contains
C
B
A
then:
cat file | grep -v B | grep -v A
will print
C
if you want to filter by only piping once
cat file | grep -v 'B\|A'
C

How to remove the last CR char with `cut`

I would like to get a portion of a string using cut. Here a dummy example:
$ echo "foobar" | cut -c1-3 | hexdump -C
00000000 66 6f 6f 0a |foo.|
00000004
Notice the \n char added at the end.
In that case there is no point to use cut to remove the last char as follow:
echo "foobar" | cut -c1-3 | rev | cut -c 1- | rev
I will still get this extra and unwanted char and I would like to avoid using an extra command such as:
shasum file | cut -c1-16 | perl -pe chomp
The \n is added by echo. Instead, use printf:
$ echo "foobar" | od -c
0000000 f o o b a r \n
0000007
$ printf "foobar" | od -c
0000000 f o o b a r
0000006
It is funny that cut itself also adds a new line:
$ printf "foobar" | cut -b1-3 | od -c
0000000 f o o \n
0000004
So the solution seems using printf to its output:
$ printf "%s" $(cut -b1-3 <<< "foobar") | od -c
0000000 f o o
0000003

Is there a way to find how many spaces before the first item in awk?

Here is the input_file:
D E F
H I Z
What I want to know is the number of white spaces before the first item. Such as a (imaginary) variable WHITE_SPACES
awk '{
print WHITE_SPACES $1
}' input_file
will return
4D
8H
Any good tricks?
Try:
awk '{print index($0, $1) - 1 $1}' input_file
You could try this,
$ echo ' D E F' | awk -v FS="[^[:space:]]" '{str=$0;match(str, FS);fs_str=substr(str, RSTART, RLENGTH);print length($1)fs_str}'
4D
I don't know what exactly you want to be the following characters whether it may be the first letter or first word.
$ echo ' Dnb E F' | awk -v FS="[^[:space:]]" '{str=$0;match(str, FS);fs_str=substr(str, RSTART, RLENGTH);print length($1)fs_str}'
4D
$ echo ' Dnb E F' | awk -v FS="[^[:space:]]+" '{str=$0;match(str, FS);fs_str=substr(str, RSTART, RLENGTH);print length($1)fs_str}'
4Dnb

Linux file splitting

I am using sed to split a file in two
I have a file that has a custom separator "/-sep-/" and I want to split the file where the separator is
currently I have:
sed -n '1,/-sep-/ {p}' /export/data.temp > /export/data.sql.md5
sed -n '/-sep-/,$ {p}' /export/data.temp > /export/data.sql
but the file 1 contains /-sep-/ at the end and the file two begins with /-sep-/
how can I handle this?
note that on the file one I should remove a break line and the /-sep-/ and on the file 2 remove the /-sep-/ and a break line :S
Reverse it: tell it what to not print instead.
sed '/-sep-/Q' /export/data.temp > /export/data.sql.md5
sed '1,/-sep-/d'/export/data.temp > /export/data.sql
(Regarding that break line, I did not understand it. A sample input would probably help.)
By the way, your original code needs only minor addition to do what you want:
sed -n '1,/-sep-/{/-sep-/!p}' /export/data.temp > /export/data.sql.md5
sed -n '/-sep-/,${/-sep-/!p}' /export/data.temp > /export/data.sql
$ cat >testfile
a
a
a
a
/-sep-/
b
b
b
b
and then
$ csplit testfile '/-sep-/' '//'
8
8
8
$ head -n 999 xx*
==> xx00 <==
a
a
a
a
==> xx01 <==
/-sep-/
==> xx02 <==
b
b
b
b
sed -n '/-sep-/q; p' /export/data.temp > /export/data.sql.md5
sed -n '/-sep-/,$ {p}' /export/data.temp | sed '1d' > /export/data.sql
Might be easier to do in one pass with awk:
awk -v out=/export/data.sql.md5 -v f2=/export/data.sql '
/-sep-/ { out=f2; next}
{ print > out }
' /exourt/data.temp

Resources