How to get current platform end of line character sequence in Rust? - rust

I'm looking for a way to get the platform end of line character sequence (CRLF for Windows, LF for Linux/macOS) at runtime.

I don't believe there is any feature that does this specifically. Even the line-aware features of the standard library don't: BufRead::read_line is documented to only recognize \n, and BufRead::lines (source), which strips end-of-line characters, only does so for \n and \r\n, indiscriminate of what platform it's invoked on.
"Platform line ending" is really a category error, though. Files are sent across networks and copied from one computer to another. If your program writes files that need to be opened on Windows in Notepad, it doesn't matter whether the program that generates them is running on Windows or Linux; it needs to emit \r\n. Similarly if the program is writing a specific file format or implementing some network protocol; the format or protocol should tell you what line separator to use. If the format allows either and there is no convention, pick the one you prefer; just use it consistently.
If you're reading line endings, you should probably tolerate either one, like BufRead::lines does.
However, should you really need to, like if your output will be read by a poorly written program that expects different line endings on different platforms, you can use conditional compilation attributes to achieve this effect:
#[cfg(windows)]
const LINE_ENDING: &'static str = "\r\n";
#[cfg(not(windows))]
const LINE_ENDING: &'static str = "\n";

Related

Why does `^M` appear in terminal output when looking at some files?

I'm trying to send file using curl to an endpoint and save the file to the machine.
Sending curl from Linux and saving it on the machine works well,
but doing the same curl from Windows is adding ^M character to every end of line.
I'm printing the file before saving it and can't see ^M. Only viewing the file on the remote machine after saving it shows me ^M.
A simple string replacement doesn't seem to work.
Why is ^M being added? How can I prevent this?
Quick Answer: That's a carriage return. They're a harmless but mildly irritating artifact of how Windows encodes text files. You can strip them out of your files with dos2unix. You can configure most text editors to use "Unix Line Endings" or "LF Line Endings" to prevent them from appearing in new files that you create from Windows PCs in the future.
Long Answer (with some historical trivia):
In a plain text file, when you create a new line (by pressing enter/return), a "line break" is embedded in the file. On Unix/Linux, this is a single character, '\n', the "line feed". On Windows, this is two sequential characters, '\r\n', the "carriage return" followed by the "line feed".
When physical teletype terminals, which behaved much like typewriters, were still in use, the "line feed" character meant "move the paper up to the next line" and the "carriage return" character meant "slide the carriage all the way over so the typing head is on the far left". From the very beginning, nearly all teletype terminals supported implicit carriage return; i.e., triggering a line feed would automatically trigger a carriage return. The developers working on what later evolved into Windows decided that it would be best to include explicit carriage returns, just in case (for some reason) the teletype does not perform one implicitly. The Unix developers, on the other hand, chose to work with the assumption of implicit carriage return.
The carriage return and line feed are ASCII Control Characters which means they do not have a visible representation as standalone printable characters, instead they affect the output cursor itself (in this case, the position of the output cursor).
The "^M" you see is a stand-in representation for the carriage return character, used by programs that don't fully "cook" their output (i.e., don't apply the effects of some ASCII Control Characters). (Other control characters have other representations starting with "^", and the "^" character is also used to represent the "ctrl" keyboard key in some Unix programs like nano.)
You can use dos2unix to convert the line endings from Windows-style to Unix-style.
$ curl https://example.com/file_with_crlf.txt | dos2unix > file.txt
On some distros, this tool is included by default, on others it can be installed via the package manager (e.g., on Ubuntu, sudo apt install dos2unix). There also exists a package, unix2dos, for the inverse.
Most "smart" text editors for coding (Sublime, Atom, VS Code, Notepad++, etc.) will happily read and write with either Windows-style or Unix-style line endings (this might require changing some configuration options). Often, the line-endings are auto-detected by scanning the contents of a file, and usually new files are created with the Operating System's native line endings (by default). Even the new version of Notepad supports Unix-style line endings. On the other hand, some Unix tools will produce strange results in the presence of Windows-style line breaks. If your codebase will be used by people on both Unix and Windows operating systems, the nice thing to do is to use Unix-style line endings everywhere.
Git on Windows also has an optional mode that checks out all files with Windows-style line breaks, but checks them back in with Unix-style line breaks.
Side Notes (interesting, but not directly related to your question):
What the carriage return actually does (on a modern virtual terminal, be it Windows or Unix) is move the output cursor to the beginning of the line. If you use the carriage return without a line feed, you can "overwrite" part of a string that has already been printed.
$ printf "dogdog" ; printf "\rcat\n"
catdog
Some Unix programs use this to asynchronously update part of the last line of output, to implement things like a live-updating progress indicator. For example, curl, which shows download progress on stdout if the file contents are piped elsewhere.
Also: If you had a tool that interpreted Windows-style line endings as literally as possible, and you fed it a string with Unix-style line endings such as "hello\nworld", you would get output like this:
hello
world
Fortunately, such implementations are extremely rare and, in general, the vast majority of Windows tools can render Unix-style line-endings identically to Windows-style line endings without any problem.

Difference between the 3 option syntax for commands in bash

In Linux command line, one can use either of two ways to pass options to commands. Either we can use the short option format which uses a single dash followed by a single letter, for example: -o or the long option format which uses two consecutive dashes followed by a word, for example: --option. But recently I came across some commands which in my thinking uses a 'hybrid' of both the formats, which uses a single dash followed by a word, for example: -option. Now I'm not talking about a commands where you can stick multiple short options together like ls -lisa. I'm talking about options where the word after the single dash is just one option and not multiple short options strung together.
I don't seem to understand why there's a third option. Because what I know about the Linux command line is you can have only a short form format or a long form format. Where did the third format came from?
It's actually confusing because sometimes you cannot be sure if the third format is really a dash followed by one option or a dash followed by multiple short options.
This is not a bash issue. All programs have their on way of handling the options/flags. There are many different styles:
the singe letter style with a single hyphen, for example:
ls -l
the mnemonic-style with double dashes, which seems a preference for GNU-stuff, for example, ls --size
the variable=value-style, for example dd if=file of=otherfile
options without dashes, as in tar cvzf arghive.tgz
You could even use a + instead of a - (as in date +%m).
etcetera.
It is important to understand that bash just passes these options to the programs/commands. So, in the programs you will generally see:
int main(int argc, char *argv[]){
(c-code example). In that case, argv[0] will point to the program-name (to simplify things a bit) and argv[1] will point to the first argument. Depending on the program, that may be different.
A quick scan through the built-in commands reveals that the built-ins always seem to use the minus-single letter (-a) for specifying options.
I think you are confusing which component does which part of the parsing.
The command line you type into bash gets parsed twice. First it gets parsed by bash. At this stage, spaces are used to separate the different parameters. Quotes and escapes are being taken into consideration. Wildcards are expanded, and $ variables are substituted.
At the end of this phase, we are left with a command line that has a list of strings, the first of which describes the command to be executed. At this point, bash calls execve, and passes it that list of strings.
The next phase of parsing is optional, and is up to each program to carry out. Most programs call getopt_long, a library function that parses options. The one and two dash convention you mention is applied by it (as well as it's older sibling, getopt).
It is, however, up to each program to parse its own parameters. Many programs use getopt_long, which is why you feel, correctly, that it is a standard. Some, however, do not. Those who do not follow their own way.
That's just how things are.
For your programs, you should try to use either getopt_long or some compatible solution, as that causes the least amount of confusion for users.

Fortran: odd space-padding string behavior when opening files

I have a Fortran program which reads data from a bunch of input files. The first file contains, among other things, the names of three other files that I will read from, specified in the input file (which I redirect to stdin at execution of the program) as follows
"data/file_1.dat" "data/file2.dat" "data/file_number_3.txt"
They're separated by regular spaces and there's no trailing spaces on the line, just a line break. I read the file names like this:
character*30 fnames(3)
read *, fnames
and then I proceed to read the data, through calling on a function which takes the file name as parameter:
subroutine read_from_data_file(fname)
implicit none
character*(*) fname
open(15,file=fname)
! read some data
end subroutine read_from_data_file
! in the main program:
do i=1,3
call read_from_data_file(trim(fnames(i)))
end do
For the third file, regardless of in which order I put the file names in the input file, the padding doesn't work and Fortran tries to open a with a name like "data/file_number_3.txt ", i.e. with a bunch of trailing spaces. This creates an empty file named data/file_number_3.txt (White Space Conflict) in my folder, and as soon as I try to read from the file the program crashes with an EOF error.
I've tried adding trim() in various places, e.g. open(15,file=trim(fname)) without any success. I assume it has something to do with the fix length of character arrays in Fortran, but I thought trim() would take care of that - is that assumption incorrect?
How do I troubleshoot and fix this?
Hmmm. I wonder if there is a final character on the last line of your input file which is not whitespace, such as an EOF marker from a Linux system popping up on a Windows system or vice-versa. Try, if you are on a Linux box, dos2unix; on a Windows box try something else (I'm not sure what).
If that doesn't work, try using the intrinsic IACHAR function to examine each individual character in the misbehaving string and examine the entrails.
Like you, I expect trim to trim trailing whitespace from a string, but not all the characters which are not displayed are regarded as whitespace.
And, while I'm writing, your use of declarations such as
character*30
is obsolescent, the modern alternative is
character(len=30)
and
character(len=*)
is preferred to
character*(*)
EDIT
Have you tried both reading those names from a file and reading them from stdin ?

Could someone explain line endings?

The current project I'm working requires me to follow certain procedures to eliminate whitespace in my code. Apparently this has got something to do with line endings since one requirement explicitly tells me to "end all lines with a Unix line ending (\n)".
I code in VIM from the terminal, and I press enter for a new line to write on. Am I missing something here?
What is the reason to keep the code clean from trailing whitespace and using specific types of line breaks?
On a side note, what standard VI/VIM settings do you guys use to adhere to common coding standards?
Sincerely,
Why
Different operating systems have different line break conventions. Unix-like systems prefer \n (LF); Windows prefers \r\n (CR LF); pre-OSX Mac OS used \r (CR). Maintaining one convention across a project is usually a good idea.
As for trailing whitespace, AFAIK it's just sloppy (may indicate "quick and dirty" reformatting). In some environments trailing whitespace might also be significant.
Perhaps not everyone on the project codes in VIM, they might be using a windows based IDE which would insert \r\n for a new line.
They would have to ensure that their line-endings are correct before committing code, whereas you shouldn't have this problem as vim will use \n as its natural line ending.
To enforce this in vim, you can use the fileformat option. Setting it to unix will make your newlines use \n.

What is the point of using *both* Carriage Returns and Line Feeds?

I'd have thought one was enough. But what's the point of doing CRLF (0x0D0A), when you can simply use CR (0D)? Normally, whenever I'm using strings (C++), I do this:
myString = "Test\nThis should be a new line!\nAnother linefeed.";
NOTE: For non-C++ programmers reading this, "\n" is a linefeed (0x0A).
But should I really be doing this:
myString = "Test\r\nThis should be a new line!\r\nAnother carriage return/linefeed pair.";
NOTE: "\r" means carriage return (0x0D).
EDIT: Should this be on Programmers.SE?
Remember that these codes all came from old Teletype machines. These were effectively typewriters: it was necessary both to advance the paper by a line (line-feed), but also to return the print head (on the carriage) to the left side of the paper (carriage-return).
Windows / Unix / old Mac systems have each different way of writing new lines in text files (not binary ones). If you're programming under windows, then in binary mode, you will read (and you probably want to write) CRLF endings. Under unix-like systems it would be just LF.
If you deal with your own data formats... it shouldn't really matter which way you choose. It all really depends only on what you want to do with the string and where did you get it from.
Some systems like UNIX and OSX just use linefeed, DOS used an additional carriage return in order to be compatible with teletype machines and Windows inherited the architecture.
You use both on Windows because that's the custom on Windows. It's that simple. But you only write both for files destined for Windows.

Resources