Could someone explain line endings? - vim

The current project I'm working requires me to follow certain procedures to eliminate whitespace in my code. Apparently this has got something to do with line endings since one requirement explicitly tells me to "end all lines with a Unix line ending (\n)".
I code in VIM from the terminal, and I press enter for a new line to write on. Am I missing something here?
What is the reason to keep the code clean from trailing whitespace and using specific types of line breaks?
On a side note, what standard VI/VIM settings do you guys use to adhere to common coding standards?
Sincerely,
Why

Different operating systems have different line break conventions. Unix-like systems prefer \n (LF); Windows prefers \r\n (CR LF); pre-OSX Mac OS used \r (CR). Maintaining one convention across a project is usually a good idea.
As for trailing whitespace, AFAIK it's just sloppy (may indicate "quick and dirty" reformatting). In some environments trailing whitespace might also be significant.

Perhaps not everyone on the project codes in VIM, they might be using a windows based IDE which would insert \r\n for a new line.
They would have to ensure that their line-endings are correct before committing code, whereas you shouldn't have this problem as vim will use \n as its natural line ending.

To enforce this in vim, you can use the fileformat option. Setting it to unix will make your newlines use \n.

Related

Why does `^M` appear in terminal output when looking at some files?

I'm trying to send file using curl to an endpoint and save the file to the machine.
Sending curl from Linux and saving it on the machine works well,
but doing the same curl from Windows is adding ^M character to every end of line.
I'm printing the file before saving it and can't see ^M. Only viewing the file on the remote machine after saving it shows me ^M.
A simple string replacement doesn't seem to work.
Why is ^M being added? How can I prevent this?
Quick Answer: That's a carriage return. They're a harmless but mildly irritating artifact of how Windows encodes text files. You can strip them out of your files with dos2unix. You can configure most text editors to use "Unix Line Endings" or "LF Line Endings" to prevent them from appearing in new files that you create from Windows PCs in the future.
Long Answer (with some historical trivia):
In a plain text file, when you create a new line (by pressing enter/return), a "line break" is embedded in the file. On Unix/Linux, this is a single character, '\n', the "line feed". On Windows, this is two sequential characters, '\r\n', the "carriage return" followed by the "line feed".
When physical teletype terminals, which behaved much like typewriters, were still in use, the "line feed" character meant "move the paper up to the next line" and the "carriage return" character meant "slide the carriage all the way over so the typing head is on the far left". From the very beginning, nearly all teletype terminals supported implicit carriage return; i.e., triggering a line feed would automatically trigger a carriage return. The developers working on what later evolved into Windows decided that it would be best to include explicit carriage returns, just in case (for some reason) the teletype does not perform one implicitly. The Unix developers, on the other hand, chose to work with the assumption of implicit carriage return.
The carriage return and line feed are ASCII Control Characters which means they do not have a visible representation as standalone printable characters, instead they affect the output cursor itself (in this case, the position of the output cursor).
The "^M" you see is a stand-in representation for the carriage return character, used by programs that don't fully "cook" their output (i.e., don't apply the effects of some ASCII Control Characters). (Other control characters have other representations starting with "^", and the "^" character is also used to represent the "ctrl" keyboard key in some Unix programs like nano.)
You can use dos2unix to convert the line endings from Windows-style to Unix-style.
$ curl https://example.com/file_with_crlf.txt | dos2unix > file.txt
On some distros, this tool is included by default, on others it can be installed via the package manager (e.g., on Ubuntu, sudo apt install dos2unix). There also exists a package, unix2dos, for the inverse.
Most "smart" text editors for coding (Sublime, Atom, VS Code, Notepad++, etc.) will happily read and write with either Windows-style or Unix-style line endings (this might require changing some configuration options). Often, the line-endings are auto-detected by scanning the contents of a file, and usually new files are created with the Operating System's native line endings (by default). Even the new version of Notepad supports Unix-style line endings. On the other hand, some Unix tools will produce strange results in the presence of Windows-style line breaks. If your codebase will be used by people on both Unix and Windows operating systems, the nice thing to do is to use Unix-style line endings everywhere.
Git on Windows also has an optional mode that checks out all files with Windows-style line breaks, but checks them back in with Unix-style line breaks.
Side Notes (interesting, but not directly related to your question):
What the carriage return actually does (on a modern virtual terminal, be it Windows or Unix) is move the output cursor to the beginning of the line. If you use the carriage return without a line feed, you can "overwrite" part of a string that has already been printed.
$ printf "dogdog" ; printf "\rcat\n"
catdog
Some Unix programs use this to asynchronously update part of the last line of output, to implement things like a live-updating progress indicator. For example, curl, which shows download progress on stdout if the file contents are piped elsewhere.
Also: If you had a tool that interpreted Windows-style line endings as literally as possible, and you fed it a string with Unix-style line endings such as "hello\nworld", you would get output like this:
hello
world
Fortunately, such implementations are extremely rare and, in general, the vast majority of Windows tools can render Unix-style line-endings identically to Windows-style line endings without any problem.

Windows/Unix line ending issues?

I have a couple files that were recently edited on windows and via Cpanel's file editor and now show up double spaced (as in an extra line CR/LF between each line). Vim is telling me (via :set ff?) the file format is unix (and I'm working on a Mac). If I show special characters via :set list all the lines just end in $. I tried setting the format via :e ++ff=mac which appears to remove all line breaks in the currently edited document and when I write the file and re-open it's back to being double spaced. I also tried searching and replacing ^M and various \r\n combinations. I know I'm missing something simple but can someone shed some light on what is going on? Is this even a line ending issue?
It appears to be a line ending issue.
The Vim wiki has this to say on the subject:
http://vim.wikia.com/wiki/File_format#Terminator_after_last_line
I, however, for expediency, when faced with a line ending problem, use BBEdit on my Mac to change them to Unix (I share, on the LAN, my eight Linux boxes with a Macbook Pro so I use a directory in Dropbox to transfer files across. scp will do the same job).
Unless you have a copy of BBEdit lying about, you can download Barebones's free Text Wrangler & it'll do the same job. Only works on a Mac obviously...

^M in PHP Files

^M is the dos carriage return that's left after each line when you move a file from a Windows box to a *NIX box. I know how to remove it. I am curious to know is there any other reason besides aesthetics that it should be removed from a PHP script.
The PHP script runs fine with it in. Normally, I would remove it without hesitation, but don't want to have my name next to each line in an svn blame command. (besides the point).
Question: Is there a reason in regards to functionality of why it should be remove other than aesthetics? It doesn't seem to break anything to keep it in. (Give me a good reason plz)
All in all, it should be fine. Other languages are picky about their line endings; I've seen it cause issues in Perl scripts, for example. But for PHP, i've never seen it matter much.
One occasion where it could conceivably matter is in multi-line strings, where the extra chars would make it through to the output. This might matter if your output is not HTML or XML. But JS shouldn't be particular about extraneous CRs, and HTML and XML will generally treat any whitespace the same as a single space (or in many cases, disregard whitespace altogether). Textareas and <pre> elements and such might end up with extra whitespace in them. That's about the only issue i can think of.

What is the point of using *both* Carriage Returns and Line Feeds?

I'd have thought one was enough. But what's the point of doing CRLF (0x0D0A), when you can simply use CR (0D)? Normally, whenever I'm using strings (C++), I do this:
myString = "Test\nThis should be a new line!\nAnother linefeed.";
NOTE: For non-C++ programmers reading this, "\n" is a linefeed (0x0A).
But should I really be doing this:
myString = "Test\r\nThis should be a new line!\r\nAnother carriage return/linefeed pair.";
NOTE: "\r" means carriage return (0x0D).
EDIT: Should this be on Programmers.SE?
Remember that these codes all came from old Teletype machines. These were effectively typewriters: it was necessary both to advance the paper by a line (line-feed), but also to return the print head (on the carriage) to the left side of the paper (carriage-return).
Windows / Unix / old Mac systems have each different way of writing new lines in text files (not binary ones). If you're programming under windows, then in binary mode, you will read (and you probably want to write) CRLF endings. Under unix-like systems it would be just LF.
If you deal with your own data formats... it shouldn't really matter which way you choose. It all really depends only on what you want to do with the string and where did you get it from.
Some systems like UNIX and OSX just use linefeed, DOS used an additional carriage return in order to be compatible with teletype machines and Windows inherited the architecture.
You use both on Windows because that's the custom on Windows. It's that simple. But you only write both for files destined for Windows.

Gedit adds line at end of file

The answer to this must be somewhere but I'm not finding it -- can anyone help me understand why in Gedit, if I have a page of code there is no extra trailing blank line, but then when I do a file comparison for my svn commit it shows an extra line being added at the end of the file?
I have a feeling that Gedit is automatically adding an ending line break. But why, I have no idea...
Reality finally won and it's been fixed, but the broken behavior is still the default; enable the WYSIWYG behavior in a terminal with
gsettings set org.gnome.gedit.preferences.editor ensure-trailing-newline false
It's a feature. I don't think it can easily be disabled.
this is intentional: text files should always be terminated by \n, otherwise
tools like 'cat', 'sed' etc may have problems. However there is no reason to
always show an empty line at the bottom of the text view, that's why we do not
show the last \n
paolo borelli [gedit developer]
Some editors (I'm unfamiliar with Gedit specifically) will try to ensure that a file always ends with a newline character. Other editors, like perhaps the one that you originally created the file with, will allow you to end a file without a final newline character.
Try the Whitespace Remover plugin.

Resources