Removing Windows newlines on Linux (sed vs. awk) - linux

Have some delimited files with improperly placed newline characters in the middle of fields (not line ends), appearing as ^M in Vim. They originate from freebcp (on Centos 6) exports of a MSSQL database. Dumping the data in hex shows \r\n patterns:
$ xxd test.txt | grep 0d0a
0000190: 3932 3139 322d 3239 3836 0d0a 0d0a 7c43
I can remove them with awk, but am unable to do the same with sed.
This works in awk, removing the line breaks completely:
awk 'gsub(/\r/,""){printf $0;next}{print}'
But this in sed does not, leaving line feeds in place:
sed -i 's/\r//g'
where this appears to have no effect:
sed -i 's/\r\n//g'
Using ^M in the sed expression (ctrl+v, ctrl+m) also does not seem to work.
For this sort of task, sed is easier to grok, but I am working on learning more about both. Am I using sed improperly, or is there a limitation?

You can use the command line tool dos2unix
dos2unix input
Or use the tr command:
tr -d '\r' <input >output
Actually, you can do the file-format switching in vim:
Method A:
:e ++ff=dos
:w ++ff=unix
:e!
Method B:
:e ++ff=dos
:set ff=unix
:w
EDIT
If you want to delete the \r\n sequences in the file, try these commands in vim:
:e ++ff=unix " <-- make sure open with UNIX format
:%s/\r\n//g " <-- remove all \r\n
:w " <-- save file
Your awk solution works fine. Another two sed solutions:
sed '1h;1!H;$!d;${g;s/\r\n//g}' input
sed ':A;/\r$/{N;bA};s/\r\n//g' input

I believe some versions of sed will not recognize \r as a character. However, you can use a bash feature to work around that limitation:
echo $string | sed $'s/\r//'
Here, you let bash replace '\r' with the actual carriage return character inside the $'...' construct before passing that to sed as its command. (Assuming you use bash; other shells should have a similar construct.)

sed -e 's/\r//g' input_file
This works for me. The difference of -e instead of -i command.
Also I mentioned that see on different platforms behave differently.
Mine is:sed --version
This is not GNU sed version 4.0

Another method
awk 1 RS='\r\n' ORS=
set Record Separator to \r\n
set Output Record Separator to empty string
1 is always true, and in the absence of an action block {print} is used

Related

sed script replacing string "\n" in string with newline character

I have recently wrote a script that will parse a whole bunch of files and increment the version number throughout. The script works fine for all files except one. It uses the following sed command (which was pieced together from various google searches and very limited sed knowledge) to find a line in a .tex file and increment the version number.
sed -i -r 's/(.*)(VERSION\}\{0.)([0-9]+)(.*)/echo "\1\2$((\3+1))\4"/ge' fileName.tex
The issue with the above (which I am unsure how to fix) is that the line it finds to change appears as
\newcommand{\VERSION}{0.123},
and the sed command replaces the "\n" in the line above with the newline character, and thus outputting
ewcommand{\VERSION}{0.124} (with a newline before it).
The desired output would be:
\newcommand{\VERSION}{0.124}
How can I fix this?
Alright so I was not able to get the answer from Cyrus to work because the file was finding about 50 other lines in my tex files it wanted to modify and I wasn't quite sure how to fix the awk statement to find just the specific line I wanted. However, I got it working with the original sed method by making a simple change.
My sed command becames two, where the first creates a temporary string %TMPSTR%, immediately followed by replacing said temp string to get the desired output and avoid any newline characters appearing.
sed -i -r 's/(.*)(VERSION\}\{0.)([0-9]+)(.*)/echo "\\%TMPSTR%{\\\2$((\3+1))\4"/ge' fileName.tex
sed -i -r 's/%TMPSTR%/newcommand/g' fileName.tex
So the line in the file goes from
\newcommand{\VERSION}{0.123} --> \%TMPSTR%{\VERSION}{0.124} --> \newcommand{\VERSION}{0.124}
and ends at the desired outcome. A bit ugly I suppose but it does what I need!
Use awk, that won't get confused by data with special characters.
Your problem could be solved by temporarily replacing the backslashes, but I hope this answer will lead you to awk.
For one line:
echo '\newcommand{\VERSION}{0.123},' | tr '\' '\r' |
sed -r 's/(.*)(VERSION\}\{0.)([0-9]+)(.*)/echo "\1\2$((\3+1))\4"/ge' | tr '\r' '\'
For a file
tr '\' '\r' < fileName.tex |
sed -r 's/(.*)(VERSION\}\{0.)([0-9]+)(.*)/echo "\1\2$((\3+1))\4"/ge' |
tr '\r' '\' > fileName.tex.tmp && mv fileName.tex.tmp fileName.tex
When \n is the only problem, you can try
sed -i -r 's/\\n/\r/g;s/(.*)(VERSION\}\{0.)([0-9]+)(.*)/echo "\1\2$((\3+1))\4"/ge;s/\r/\\n/' fileName.tex

Why do SED, GREP or AWK fail to remove blank lines from text files?

I am trying to remove blank lines from large text files. For some reason it seems that neither
sed "/^$/d" file.txt > trimmed.txt
nor
grep -v "^$" file.txt > trimmed.txt
nor
awk /./ file.txt > trimmed.txt
do anything. Any thoughts?
UPDATE
Thanks to the great comments by #fedorqui & #Sebastian Stigler the problem was quickly identified as DOS/Windows carriage returns (^M$) at the end of each line.
While I appreciate Sebatian's suggestion to reformat the files using dos2unix I would rather have a solution using the tools generally available in most linux distributions.
The solution that worked for me was an answer given by #Jeremy Stein to this question [Can't remove empty lines with sed regex:
sed -n '/[!-~]/p' file.txt > trimmed.txt
I just tried the commandos with a toy example and they work fine as long the file.txt was a file with unix newlines. If the file contains windows newlines then none of the commands were able to remove the blank lines.
You can use the dos2unix linux tool to convert the newlines in file.txt to unix newlines. If you need the output on a windows system then you can use unix2dos to convert trimmed.txt into a file with windows newlines.

How to remove OCTAL character using Linux?

I have a large file that I need to edit in Linux.
the file has data fields enclosed by double quotes ( "" ). But when I open the file using notepad++ I see SOH character between the double quotes (ie. "filed1"SOH"field2"SOHSOH"field3"SOH"field4")
And when I open the same file in vim I see the double quotes followed by ^A character. (ie. "filed1"^A"field2"^A^A"field3"^A"field4")
Then when I execute this command in the command line
cat filename.txt | od -c | more
I see that the character is shown as 001 (ie. "filed1"001"field2"001001"field3"001"field4")
I have tried the following via vim
:s%/\\001//g
I also tried this command
sed -e s/\001//g filename.text > filename_new.txt
sed -e s/\\001//g filename.text > filename_new.txt
I need to remove those characters from that file.
How can I do that?
Your attempts at escaping the SOH character with \001 were close.
GNU sed has an extension to specify a decimal value with \d001 (there are also octal and hexadecimal variants):
$ sed -i -e 's/\d001//g' file.txt
In Vim, the regular expression atom looks slightly different: \%d001; alternatively, you can directly enter the character in the :%s command-line via Ctrl + V followed by 001; cp. :help i_CTRL-V_digit.
Use echo -e to get a literal \001 character into your sed command:
$ sed -i -e $(echo -e 's/\001//g') file.txt
(-i is a GNU sed extension to request in-place editing.)
just keep it simple with awk instead of having to fuss with quotation formatting issues :
mawk NF=NF FS='\1' OFS=
"filed1""field2""field3""field4"

sed -e 's/^M/d' not working

A very common problem, but I am unable to work around it with sed.
I have a script file ( a batch of commands) say myfile.txt to be executed at once to create a list. Now when I am executing a batch operation my command line interface clearly shows its unable to parse the command as a line feed ^M is adding up at end of each line.
I thought sed to be the best way to go about it.I tried:
sed -e 's/^M/d' myfile.txt > myfile1.txt
mv myfile1.txt myfile.txt
It didn't work. I also tried this and it didn't work:
sed -e 's/^M//g' myfile.txt > myfile1.txt
mv myfile1.txt myfile.txt
Then I thought may be sed is taking it as a M character in the beginning of line, and hence no result. So I tried:
sed -e 's/\^M//g' myfile.txt > myfile1.txt
mv myfile1.txt myfile.txt
But no change. Is there a basic mistake I am doing ? Kindly advise as I am bad at sed.
I found a resolution though which was to open the file in vi editor and in command mode execute this:
:set fileformat=unix
:w
But I want it in sed as well.
^M is not literally ^M. Replace ^M with \r. You can use the same representation for tr; these two commands both remove carriage returns:
tr -d '\r' < input.txt > output.txt
sed -e 's/\r//g' input.txt > output.txt
sed -e 's/^M/d' myfile.txt
Has the following meaning [the same for /\^M/ ]: If the first letter of the line is M, then remove the line, else print it and pass to next.. And you have to insert 2 separators /old/new/ in s[earch command].
This may help you.
Late, but here for posterity: sed Delete / Remove ^M Carriage Return (Line Feed / CRLF) on Linux or Unix
The gist, which answers the above question: to get ^M type CTRL+V followed by CTRL+M i.e. don’t just type the carat symbol and a capital M. It will not work

Best tool to remove dos line ends and join line back up again

I have a csv file into which has crept some ^M dos line ends, and I want to get rid of them, as well as 16 spaces and 3 tabs which follow. Like, I have to merge that line with the next one down. Heres an offending record and a good one as a sample of what I mean:
"Mary had a ^M
little lamb", "Nursery Rhyme", 1878
"Mary, Mary quite contrary", "Nursery Rhyme", 1838
I can remove the ^M using sed as you can see, but I cannot work out how to rm the nix line end to join the lines back up.
sed -e "s/^M$ //g" rhymes.csv > rhymes.csv
UPDATE
Then I read "However, the Microsoft CSV format allows embedded newlines within a double-quoted field. If embedded newlines within fields are a possibility for your data, you should consider using something other than sed to work with the data file." from:
http://sed.sourceforge.net/sedfaq4.html
So editing my question to ask Which tool I should be using?
With help from How can I replace a newline (\n) using sed?, I made this one:
sed -e ':a;N;$!ba;s/\r\n \t\t\t/=/' -i rhymes.csv
<CR> <LF> <16 spaces> <3 tabs>
If you just want to delete the CR, you could use:
<yourfile tr -d "\r" | tee yourfile
(or if the two input and output file are different: <yourfile tr -d "\r" > output)
dos2unix file_name
to convert file, or
dos2unix old_file new_file
to create new file.

Resources