why Linux tools display the CR character as `^M`? [duplicate]

why Linux tools display the CR character as `^M`? [duplicate] - linux

This question already has answers here:
What does the ^M character mean in Vim?
(15 answers)
Closed 2 years ago.
I'm new to Linux sorry if my question sounds dumb.
We know that Linux and Mac OS X use \n (0xa), which is the ASCII line feed (LF) character. MS Windows and Internet protocols such as HTTP use the sequence \r\n (0xd 0xa). If you create a file foo.txt in Windows and then view it in a Linux text editor, you’ll see an annoying ^M at the end of each line, which is how Linux tools display the CR character.
Bu why Linux tools display the CR character as ^M? as my understanding is, \r (carriage return) is to move the cursor in the beginning of the current line, so the sensible approach to display it is like, when you open the file, you see the cursor is in the beginning of the line(that contains \r), so ^M shouldn't be displayed?
PS: some people post answers that how to remove ^M, but I wnat to know why eventually^M is displayed rather than moving the cursor in the beginning, which is the definition of carriage return.

The ASCII control characters like TAB, CR, NL and others are intended to control the printing position of a teletypewriter-like display device.
A text editor isn't such a device. It is not appropriate for a text editor to treat a CR character literally as meaning "go to the first column"; it would make a confusing gibberish out of the editing experience.
A text editor works by parsing a text file's representation, to create an internal representation which is presented to the user. On Unix-like operating systems, a file is represented by zero or more lines, which are terminated by the ASCII NL character. Any CR characters that occur just look like part of the data, and not part of the line separation.
Not all editors behave the same way. For instance, the Vim editor will detect that a file uses CR-LF line endings, and load it properly using that representation. A flag is set for that buffer which indicates that it's a "DOS" file, so that when you save it, the same representation is reproduced.
That said, there is a feature in the actual Linux kernel for representing control characters like CR using the ^M notation. The TTY line discipline for any given TTY device can be configured to print characters in this notation, but only when echoing back the characters received.
Demo:
$ stty echoctl # turn on notational echo of control characters
$ cat # run some non-interactive program with rudimentary line input
^F^F^F^F^F^F
^C
$
Above, the Ctrl-F that I entered was echoed back as ^F. So, in fact there is a "Linux editor" which uses this notation: the rudimentary line editor of the "canonical input mode" line discipline.

Related

Copy folders from list [duplicate]

I'd like to know the difference (with examples if possible) between
CR LF (Windows), LF (Unix) and CR (Macintosh) line break types.

CR and LF are control characters, respectively coded 0x0D (13 decimal) and 0x0A (10 decimal).
They are used to mark a line break in a text file. As you indicated, Windows uses two characters the CR LF sequence; Unix only uses LF and the old MacOS ( pre-OSX MacIntosh) used CR.
An apocryphal historical perspective:
As indicated by Peter, CR = Carriage Return and LF = Line Feed, two expressions have their roots in the old typewriters / TTY. LF moved the paper up (but kept the horizontal position identical) and CR brought back the "carriage" so that the next character typed would be at the leftmost position on the paper (but on the same line). CR+LF was doing both, i.e. preparing to type a new line. As time went by the physical semantics of the codes were not applicable, and as memory and floppy disk space were at a premium, some OS designers decided to only use one of the characters, they just didn't communicate very well with one another ;-)
Most modern text editors and text-oriented applications offer options/settings etc. that allow the automatic detection of the file's end-of-line convention and to display it accordingly.

This is a good summary I found:
The Carriage Return (CR) character (0x0D, \r) moves the cursor to the beginning of the line without advancing to the next line. This character is used as a new line character in Commodore and early Macintosh operating systems (Mac OS 9 and earlier).
The Line Feed (LF) character (0x0A, \n) moves the cursor down to the next line without returning to the beginning of the line. This character is used as a new line character in Unix-based systems (Linux, Mac OS X, etc.)
The End of Line (EOL) sequence (0x0D 0x0A, \r\n) is actually two ASCII characters, a combination of the CR and LF characters. It moves the cursor both down to the next line and to the beginning of that line. This character is used as a new line character in most other non-Unix operating systems including Microsoft Windows, Symbian and others.
Source

It's really just about which bytes are stored in a file. CR is a bytecode for carriage return (from the days of typewriters) and LF similarly, for line feed. It just refers to the bytes that are placed as end-of-line markers.
Way more information, as always, on wikipedia.

Summarized succinctly:
Carriage Return (Mac pre-OS X)
CR
\r
ASCII code 13
Line Feed (Linux, Mac OS X)
LF
\n
ASCII code 10
Carriage Return and Line Feed (Windows)
CRLF
\r\n
ASCII code 13 and then ASCII code 10
If you see ASCII code in a strange format, they are merely the number 13 and 10 in a different radix/base, usually base 8 (octal) or base 16 (hexadecimal).
ASCII chart

Jeff Atwood has a recent blog post about this: The Great Newline Schism
Here is the essence from Wikipedia:
The sequence CR+LF was in common use
on many early computer systems that
had adopted teletype machines,
typically an ASR33, as a console
device, because this sequence was
required to position those printers at
the start of a new line. On these
systems, text was often routinely
composed to be compatible with these
printers, since the concept of device
drivers hiding such hardware details
from the application was not yet well
developed; applications had to talk
directly to the teletype machine and
follow its conventions. The separation
of the two functions concealed the
fact that the print head could not
return from the far right to the
beginning of the next line in
one-character time. That is why the
sequence was always sent with the CR
first. In fact, it was often necessary
to send extra characters (extraneous
CRs or NULs, which are ignored) to
give the print head time to move to
the left margin. Even after teletypes
were replaced by computer terminals
with higher baud rates, many operating
systems still supported automatic
sending of these fill characters, for
compatibility with cheaper terminals
that required multiple character times
to scroll the display.

CR - ASCII code 13
LF - ASCII code 10.
Theoretically, CR returns the cursor to the first position (on the left). LF feeds one line, moving the cursor one line down. This is how in the old days you controlled printers and text-mode monitors.
These characters are usually used to mark end of lines in text files.
Different operating systems used different conventions. As you pointed out, Windows uses the CR/LF combination while pre-OS X Macs use just CR and so on.

CR and LF are a special set of characters that help us format our code.
CR(/r) stands for CARRIAGE RETURN. It puts the cursor at the beginning of a line, but it doesn't create a new line. This is how MAC OS works.
LF(/n) stands for LINE FEED. It creates a new line, but it doesn't put the cursor at the beginning of that line. The cursor stays back at the end of the last line. This is how Unix and Linux work.
CRLF (/r/f) creates a new line as well as puts the cursor at the beginning of the new line. This is how we see it in Windows OS.
Git uses LF by default. So when we use Git on Windows it throws a warning like "CRLF will be replaced by LF" and automatically converts all CRLF into LF, so that code becomes compatible.
NB: Don't worry...see this less as a warning and more as a notice thing.

Systems based on ASCII or a
compatible character set use either LF
(Line feed, 0x0A, 10
in decimal) or CR (Carriage return, 0x0D, 13 in decimal)
individually, or CR followed by
LF (CR+LF, 0x0D 0x0A);
These characters are based on printer commands: The line feed
indicated that one line of
paper should feed out of the printer, and a carriage return
indicated that the printer
carriage should return to the beginning of the current line.
Here is the details.

The sad state of "record separators" or "line terminators" is a legacy of the dark ages of computing.
Now, we take it for granted that anything we want to represent is in some way structured data and conforms to various abstractions that define lines, files, protocols, messages, markup, whatever.
But once upon a time this wasn't exactly true. Applications built-in control characters and device-specific processing. The brain-dead systems that required both CR and LF simply had no abstraction for record separators or line terminators. The CR was necessary in order to get the teletype or video display to return to column one and the LF (today, NL, same code) was necessary to get it to advance to the next line. I guess the idea of doing something other than dumping the raw data to the device was too complex.
Unix and Mac actually specified an abstraction for the line end, imagine that. Sadly, they specified different ones. (Unix, ahem, came first.) And naturally, they used a control code that was already "close" to S.O.P.
Since almost all of our operating software today is a descendent of Unix, Mac, or Microsoft operating software, we are stuck with the line ending confusion.

NL is derived from EBCDIC NL = 0x15 which would logically compare to CRLF 0x0D 0x0A ASCII... This becomes evident when physically moving data from mainframes to midrange. Colloquially (as only arcane folks use EBCDIC), NL has been equated with either CR or LF or CRLF.

Why does `^M` appear in terminal output when looking at some files?

I'm trying to send file using curl to an endpoint and save the file to the machine.
Sending curl from Linux and saving it on the machine works well,
but doing the same curl from Windows is adding ^M character to every end of line.
I'm printing the file before saving it and can't see ^M. Only viewing the file on the remote machine after saving it shows me ^M.
A simple string replacement doesn't seem to work.
Why is ^M being added? How can I prevent this?

Quick Answer: That's a carriage return. They're a harmless but mildly irritating artifact of how Windows encodes text files. You can strip them out of your files with dos2unix. You can configure most text editors to use "Unix Line Endings" or "LF Line Endings" to prevent them from appearing in new files that you create from Windows PCs in the future.
Long Answer (with some historical trivia):
In a plain text file, when you create a new line (by pressing enter/return), a "line break" is embedded in the file. On Unix/Linux, this is a single character, '\n', the "line feed". On Windows, this is two sequential characters, '\r\n', the "carriage return" followed by the "line feed".
When physical teletype terminals, which behaved much like typewriters, were still in use, the "line feed" character meant "move the paper up to the next line" and the "carriage return" character meant "slide the carriage all the way over so the typing head is on the far left". From the very beginning, nearly all teletype terminals supported implicit carriage return; i.e., triggering a line feed would automatically trigger a carriage return. The developers working on what later evolved into Windows decided that it would be best to include explicit carriage returns, just in case (for some reason) the teletype does not perform one implicitly. The Unix developers, on the other hand, chose to work with the assumption of implicit carriage return.
The carriage return and line feed are ASCII Control Characters which means they do not have a visible representation as standalone printable characters, instead they affect the output cursor itself (in this case, the position of the output cursor).
The "^M" you see is a stand-in representation for the carriage return character, used by programs that don't fully "cook" their output (i.e., don't apply the effects of some ASCII Control Characters). (Other control characters have other representations starting with "^", and the "^" character is also used to represent the "ctrl" keyboard key in some Unix programs like nano.)
You can use dos2unix to convert the line endings from Windows-style to Unix-style.
$ curl https://example.com/file_with_crlf.txt | dos2unix > file.txt
On some distros, this tool is included by default, on others it can be installed via the package manager (e.g., on Ubuntu, sudo apt install dos2unix). There also exists a package, unix2dos, for the inverse.
Most "smart" text editors for coding (Sublime, Atom, VS Code, Notepad++, etc.) will happily read and write with either Windows-style or Unix-style line endings (this might require changing some configuration options). Often, the line-endings are auto-detected by scanning the contents of a file, and usually new files are created with the Operating System's native line endings (by default). Even the new version of Notepad supports Unix-style line endings. On the other hand, some Unix tools will produce strange results in the presence of Windows-style line breaks. If your codebase will be used by people on both Unix and Windows operating systems, the nice thing to do is to use Unix-style line endings everywhere.
Git on Windows also has an optional mode that checks out all files with Windows-style line breaks, but checks them back in with Unix-style line breaks.
Side Notes (interesting, but not directly related to your question):
What the carriage return actually does (on a modern virtual terminal, be it Windows or Unix) is move the output cursor to the beginning of the line. If you use the carriage return without a line feed, you can "overwrite" part of a string that has already been printed.
$ printf "dogdog" ; printf "\rcat\n"
catdog
Some Unix programs use this to asynchronously update part of the last line of output, to implement things like a live-updating progress indicator. For example, curl, which shows download progress on stdout if the file contents are piped elsewhere.
Also: If you had a tool that interpreted Windows-style line endings as literally as possible, and you fed it a string with Unix-style line endings such as "hello\nworld", you would get output like this:
hello
world
Fortunately, such implementations are extremely rare and, in general, the vast majority of Windows tools can render Unix-style line-endings identically to Windows-style line endings without any problem.

Why is vim stripping the carriage return when I copy a line to another file?

I sorted a file a.csv into b.csv.
I noticed that the sizes of the files differed, and after noticing that b.csv was exactly n bytes smaller (where n is the number of lines in a.csv), I immediately suspected that a.csv contained those pesky \r.
The .py script for sorting contained the line line.strip() which removed the carriage returns and then afile.write(line2 + '\n') which wrote newlines but not carriage returns.
Ok. Makes sense.
The strange bit is that when I vim'd a.csv, I didn't see the ^M like I usually do (maybe the reason lies in a configuration file), so I only found out about the \r from opening the file in a hex editor.
The more interesting bit, is that I would take a small subset of a.csv (3y) and paste it to a testfile (p).
Sorting the testfile resulted in a file of the exact same size as the original.
From xxding, I see that there is no \r in the new testfile.
When I yank a line that contains a carriage return and paste it into another file, the pasted line does not contain the carriage return. Why?
I tested this on Windows (Cygwin), and it does appear to copy the \r. But on the Linux machine I'm using, it doesn't.
How come?
Edit:
I tried reproducing the issue on another linux machine, but I couldn't. It appears to be a configuration thing - some file somewhere telling vim to do that.

Vim's model of a loaded file is a sequence of lines, each consisting of a sequence of characters. In this model, newlines aren't themselves characters. So when you're copying lines of text, you're not copying the CRs or LFs. Vim also stores a number of other pieces of information which are used to write the file back out again, principally:
fileformat can be unix, dos or mac. This determines what end-of-line character will be written at the end of each line.
endofline can be on or off. This determines if the last line of the file has an end-of-line character.
bomb can be on or off. This determines if a byte order mark is written at the start of the first line.
fileencoding specifies what character encoding will be used to store the file, such as utf-8.
Normally these are all auto-detected upon loading the file. In particular, fileformat will be auto-detected depending on the settings in fileformats option, which may be configured differently on different platforms. However, sometimes things can go wrong. The most common problem is that a file might have mixed line-endings, and that's when you'll start seeing ^M floating around. In this case, Vim has loaded the file as if it's in unix format - it treated the LFs as the line separators and the CRs as just normal characters. You can see which mode Vim has opened the file in by entering :set fileformat? or just set ff? for short.

Vim detects the newline style (Windows CR-LF vs. Unix LF) when opening the file (according to the 'fileformats' option), and uses the detected 'fileformat' value for all subsequent saves. So, the newline style is a property of the Vim buffer / opened file. When you yank line(s) from one buffer and paste it into another, the newline style isn't kept; instead, the newline style of the target buffer is used, as this makes much more sense.

cursor(s) in hex edit with vi/vim

I knew that we can hex edit a file with vi/vim, using the command %!xxd (call *nix hex dump) and %!xxd -r (exit *nix hex dump).
The problem is, if I do some hex-editing in the hex-code area, there is no corresponding cursor displayed in the ascii-code area, and vice-versa.
In contrast, when the file is edited with ghex, there are two cursors, one is with the current edit operation, the other shows the corresponding position in the other panel.
For example, if a text file contains a letter 'f', and I am using ghex to edit it, the cursor in the right panel will show the current character to be edit is 'f', when I move the cursor to the hex value 0x66 in the left panel.
Does this feature already exist in vi/vim/xxd, but I haven't found out?

Just so we're clear, xxd is not a vim command; it is an external program that translates to/from hex dumps. The command %!xxd means 'run the external program xxd, passing it the contents of this file via stdin, and replace the contents of the file with the result.'
From that, I hope you understand that you are not using some special mode of vim to edit these hex dumps. The hex dump is simply the text you see, and you are editing it as a normal text file.
There may be some extension to vim which provides the functionality you are looking for (I haven't looked very hard), but in answer to your question, there is no built-in functionality to do this.

How to have a carriage return without bringing about a linebreak in VIM?

Is it possible to have a carriage return without bringing about a linebreak ?
For instance I want to write the following sentences in 2 lines and not 4 (and I do not want to type spaces of course) :
On a ship at sea: a tempestuous noise of thunder and lightning heard.
Enter a Master and a Boatswain
Master : Boatswain!
Boatswain : Here, master: what cheer?
Thanks in advance for your help
Thierry

In a text file, the expected line-end character or character sequence is platform dependent. On Windows, the sequence "carriage return (CR, \r) + line feed (LF, \n)" is used, while Unix systems use newline only (LF \n). Macintoshes traditionally used \r only, but these days on OS X I see them dealing with just about any version. Text editors on any system are often able to support all three versions, and to convert between them.
For VIM, see this article for tips how to convert/set line end character sequences.
However, I'm not exactly sure what advantage the change would have for you: Whichever sequence or character you use, it is just the marker for the end of the line (so there should be one of these, at the end of the first line and you'd have a 2 line text file in any event). However, if your application expects a certain character, you can either change the application -- many programming languages support some form of "universal" newline -- or change the data.

Just in case this is what you're looking for:
:set wrap
:set linebreak
The first tells vim to wrap long lines, and the second tells it to only break lines at word breaks, instead of in the middle of words when it reaches the window size.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string