How to find out line-endings in a text file?

How to find out line-endings in a text file? - linux

I'm trying to use something in bash to show me the line endings in a file printed rather than interpreted. The file is a dump from SSIS/SQL Server being read in by a Linux machine for processing.
Are there any switches within vi, less, more, etc?
In addition to seeing the line-endings, I need to know what type of line end it is (CRLF or LF). How do I find that out?

You can use the file utility to give you an indication of the type of line endings.
Unix:
$ file testfile1.txt
testfile.txt: ASCII text
"DOS":
$ file testfile2.txt
testfile2.txt: ASCII text, with CRLF line terminators
To convert from "DOS" to Unix:
$ dos2unix testfile2.txt
To convert from Unix to "DOS":
$ unix2dos testfile1.txt
Converting an already converted file has no effect so it's safe to run blindly (i.e. without testing the format first) although the usual disclaimers apply, as always.

Ubuntu 14.04:
simple cat -e <filename> works just fine.
This displays Unix line endings (\n or LF) as $ and Windows line endings (\r\n or CRLF) as ^M$.

In vi...
:set list to see line-endings.
:set nolist to go back to normal.
While I don't think you can see \n or \r\n in vi, you can see which type of file it is (UNIX, DOS, etc.) to infer which line endings it has...
:set ff
Alternatively, from bash you can use od -t c <filename> or just od -c <filename> to display the returns.

In the bash shell, try cat -v <filename>. This should display carriage-returns for windows files.
(This worked for me in rxvt via Cygwin on Windows XP).
Editor's note: cat -v visualizes \r (CR) chars. as ^M. Thus, line-ending \r\n sequences will display as ^M at the end of each output line. cat -e will additionally visualize \n, namely as $. (cat -et will additionally visualize tab chars. as ^I.)

Try file, then file -k, then dos2unix -ih
file will usually be enough. But for tough cases try file -k or dosunix -ih.
Details below.
Try file -k
Short version: file -k somefile.txt will tell you.
It will output with CRLF line endings for DOS/Windows line endings.
It will output with CR line endings for MAC line endings.
It will just output text for Linux/Unix "LF" line endings. (So if it does not explicitly mention any kind of line endings then this means: "LF line endings".)
Long version see below.
Real world example: Certificate Encoding
I sometimes have to check this for PEM certificate files.
The trouble with regular file is this: Sometimes it's trying to be too smart/too specific.
Let's try a little quiz: I've got some files. And one of these files has different line endings. Which one?
(By the way: this is what one of my typical "certificate work" directories looks like.)
Let's try regular file:
$ file -- *
0.example.end.cer: PEM certificate
0.example.end.key: PEM RSA private key
1.example.int.cer: PEM certificate
2.example.root.cer: PEM certificate
example.opensslconfig.ini: ASCII text
example.req: PEM certificate request
Huh. It's not telling me the line endings. And I already knew that those were cert files. I didn't need "file" to tell me that.
Some network appliances are really, really picky about how their certificate files are encoded. That's why I need to know.
What else can you try?
You might try dos2unix with the --info switch like this:
$ dos2unix --info -- *
37 0 0 no_bom text 0.example.end.cer
0 27 0 no_bom text 0.example.end.key
0 28 0 no_bom text 1.example.int.cer
0 25 0 no_bom text 2.example.root.cer
0 35 0 no_bom text example.opensslconfig.ini
0 19 0 no_bom text example.req
So that tells you that: yup, "0.example.end.cer" must be the odd man out. But what kind of line endings are there? Do you know the dos2unix output format by heart? (I don't.)
But fortunately there's the --keep-going (or -k for short) option in file:
$ file --keep-going -- *
0.example.end.cer: PEM certificate\012- , ASCII text, with CRLF line terminators\012- data
0.example.end.key: PEM RSA private key\012- , ASCII text\012- data
1.example.int.cer: PEM certificate\012- , ASCII text\012- data
2.example.root.cer: PEM certificate\012- , ASCII text\012- data
example.opensslconfig.ini: ASCII text\012- data
example.req: PEM certificate request\012- , ASCII text\012- data
Excellent! Now we know that our odd file has DOS (CRLF) line endings. (And the other files have Unix (LF) line endings. This is not explicit in this output. It's implicit. It's just the way file expects a "regular" text file to be.)
(If you wanna share my mnemonic: "L" is for "Linux" and for "LF".)
Now let's convert the culprit and try again:
$ dos2unix -- 0.example.end.cer
$ file --keep-going -- *
0.example.end.cer: PEM certificate\012- , ASCII text\012- data
0.example.end.key: PEM RSA private key\012- , ASCII text\012- data
1.example.int.cer: PEM certificate\012- , ASCII text\012- data
2.example.root.cer: PEM certificate\012- , ASCII text\012- data
example.opensslconfig.ini: ASCII text\012- data
example.req: PEM certificate request\012- , ASCII text\012- data
Good. Now all certs have Unix line endings.
Try dos2unix -ih
I didn't know this when I was writing the example above but:
Actually it turns out that dos2unix will give you a header line if you use -ih (short for --info=h) like so:
$ dos2unix -ih -- *
DOS UNIX MAC BOM TXTBIN FILE
0 37 0 no_bom text 0.example.end.cer
0 27 0 no_bom text 0.example.end.key
0 28 0 no_bom text 1.example.int.cer
0 25 0 no_bom text 2.example.root.cer
0 35 0 no_bom text example.opensslconfig.ini
0 19 0 no_bom text example.req
And another "actually" moment: The header format is really easy to remember: Here's two mnemonics:
It's DUMB (left to right: d for Dos, u for Unix, m for Mac, b for BOM).
And also: "DUM" is just the alphabetical ordering of D, U and M.
Further reading
man file
man dos2unix
Wikipedia: Newline

To show CR as ^M in less use less -u or type -u once less is open.
man less says:
-u or --underline-special
Causes backspaces and carriage returns to be treated as print-
able characters; that is, they are sent to the terminal when
they appear in the input.

You can use xxd to show a hex dump of the file, and hunt through for "0d0a" or "0a" chars.
You can use cat -v <filename> as #warriorpostman suggests.

You may use the command todos filename to convert to DOS endings, and fromdos filename to convert to UNIX line endings. To install the package on Ubuntu, type sudo apt-get install tofrodos.

You can use vim -b filename to edit a file in binary mode, which will show ^M characters for carriage return and a new line is indicative of LF being present, indicating Windows CRLF line endings. By LF I mean \n and by CR I mean \r. Note that when you use the -b option the file will always be edited in UNIX mode by default as indicated by [unix] in the status line, meaning that if you add new lines they will end with LF, not CRLF. If you use normal vim without -b on a file with CRLF line endings, you should see [dos] shown in the status line and inserted lines will have CRLF as end of line. The vim documentation for fileformats setting explains the complexities.
Also, I don't have enough points to comment on the Notepad++ answer, but if you use Notepad++ on Windows, use the View / Show Symbol / Show End of Line menu to display CR and LF. In this case LF is shown whereas for vim the LF is indicated by a new line.

I dump my output to a text file. I then open it in notepad ++ then click the show all characters button. Not very elegant but it works.

Vim - always show Windows newlines as ^M
If you prefer to always see the Windows newlines in vim render as ^M, you can add this line to your .vimrc:
set ffs=unix
This will make vim interpret every file you open as a unix file. Since unix files have \n as the newline character, a windows file with a newline character of \r\n will still render properly (thanks to the \n) but will have ^M at the end of the file (which is how vim renders the \r character).
Vim - sometimes show Windows newlines
If you'd prefer just to set it on a per-file basis, you can use :e ++ff=unix when editing a given file.
Vim - always show filetype (unix vs dos)
If you want the bottom line of vim to always display what filetype you're editing (and you didn't force set the filetype to unix) you can add to your statusline with
set statusline+=\ %{&fileencoding?&fileencoding:&encoding}.
My full statusline is provided below. Just add it to your .vimrc.
" Make statusline stay, otherwise alerts will hide it
set laststatus=2
set statusline=
set statusline+=%#PmenuSel#
set statusline+=%#LineNr#
" This says 'show filename and parent dir'
set statusline+=%{expand('%:p:h:t')}/%t
" This says 'show filename as would be read from the cwd'
" set statusline+=\ %f
set statusline+=%m\
set statusline+=%=
set statusline+=%#CursorColumn#
set statusline+=\ %y
set statusline+=\ %{&fileencoding?&fileencoding:&encoding}
set statusline+=\[%{&fileformat}\]
set statusline+=\ %p%%
set statusline+=\ %l:%c
set statusline+=\
It'll render like
.vim/vimrc\ [vim] utf-8[unix] 77% 315:6
at the bottom of your file
Vim - sometimes show filetype (unix vs dos)
If you just want to see what type of file you have, you can use :set fileformat (this will not work if you've force set the filetype). It will return unix for unix files and dos for Windows.

Related

Command does not read entire file

I'm having a weird problem. My commands don't read .txt files that I'm saving from Excel. I've tried to save the data in all the available .txt formats available in Excel, but when I run a command it does not read it. Actually it seems to read the first line of the file, but only if the first line of the file contains Parcela 1. If I create a plain .txt file from text editor though, it reads it no matter how many lines.
Does anyone know what I am doing wrong?
One of my codes:
awk -F"\t" '
{ if ($7 ~ /Parcela 1/)
print;
else }' source.txt > output.txt

It is virtually certain that the problem is related to Unix vs Windows vs old-style Mac line-endings. Excel (at least Excel 2008 and 2011 on Mac) can write files in a variety of formats. None of these has 'Unix native' line endings.
For example, using Excel 2011, I got:
$ file *.dif *.csv *.txt *.prn | sort
Data Interchange Format.dif: Non-ISO extended-ASCII text, with CRLF line terminators
MS-DOS Comma Separated.csv: Non-ISO extended-ASCII text, with CR line terminators
MS-DOS Formatted Text.txt: Non-ISO extended-ASCII text, with CR line terminators
Space Delimited Text.prn: Non-ISO extended-ASCII text, with CR line terminators
Tab Delimited Text.txt: Non-ISO extended-ASCII text, with CR line terminators
UTF-16 Unicode Text.txt: Little-endian UTF-16 Unicode text, with CRLF line terminators
Windows Comma Separated.csv: ISO-8859 text, with CRLF line terminators
Windows Formatted Text.txt: ISO-8859 text, with CRLF line terminators
$ ule *.dif *.csv *.txt *.prn | sort
Data Interchange Format.dif: 2301 DOS, No final EOL
MS-DOS Comma Separated.csv: 103 Mac, No final EOL
MS-DOS Formatted Text.txt: 103 Mac, No final EOL
Space Delimited Text.prn: 104 Mac
Tab Delimited Text.txt: 103 Mac, No final EOL
UTF-16 Unicode Text.txt: 103 Unix, 103 Mac, No final EOL, 11019 null bytes
Windows Comma Separated.csv: 103 DOS, No final EOL
Windows Formatted Text.txt: 103 DOS, No final EOL
$
The file names correspond to the save format chosen from the Excel drop-down box. The output from file shows that the none of the formats are standard Unix text files. The ule (Uniform Line Endings) program is one of my own devising; it was used in its default 'check' mode. It is interesting that most of the files to not have a final end of line sequence; the data stops without a final newline.
$ ule -h
Usage: ule [-bcdhmnosuzV] [file ...]
-b Create backups of the files
-c Check line endings (default)
-d Convert to DOS (CRLF) line endings
-h Print this help and exit
-m Convert to MAC (CR) line endings
-n Ensure line ending at end of file
-o Overwrite original files
-s Write output to standard output (default)
-u Convert to Unix (LF) line endings
-z Check for zero (null) bytes
-V Print version information and exit
$
On Unix systems, lines end with the newline (NL — aka LF or linefeed) character. On Windows, normally lines end with CRLF, carriage return and linefeed; on classic Mac OS (before Mac OS X), and apparently for MS-DOS with the Office products, the lines end with just CR, carriage return.
awk reads lines. If you try to process one of the files with only CR line endings, awk will consider that the file contains a single line. If you try to process one of the files with CRLF line endings, awk will recognize the lines OK (they end at the LF), but will consider the CR to be part of the last field.
So, depending on what you're really after, you should be using one of the 'Windows*' formats. The 'Parcela 1' lines are 92, 99 and 102 in those files.
awk -F"\t" '{ if ($7 ~ /Parcela 1/) print; }' "Windows Formatted Text.txt"
9/6/19 (Parcela 1)FINANCIAMENTO FATURA JULHO EM 4X (Dividido) "($1,052.38)"
9/6/19 (Parcela 1)ROUPAS GUI 6.1.1.10 - DESPESAS PESSOAIS:6.1.1.10.004 - VESTUARIO ($44.70)
9/6/19 "(Parcela 1)TROCA 2 PNEUS DIANTEIROS, BALANCEAMENTO E ALINHAMENTO FOX" 6.1.1.02 - TRANSPORTE:6.1.1.02.001 - AUTOMOVEL:6.1.1.02.001 - MANUTENCAO ($282.68)
Any of the other formats is going to give problems in some shape or form, until you massage them into a format that is recognized by awk, e.g. by running:
tr '\r' '\n' < "MS-DOS Comma Separated Text.csv" > "Unix Comma Separated Text.csv"
You can then apply awk to the "Unix Comma Separated Text.csv" file safely.

How to check if file has special characters such as CR in linux?

How to check if file has special characters such as CR in linux?
I have some files that may include special characters. I Know that if I do dos2unix, it will fix the issue. But I recall there is a way to actually view the file and see/print the special characters in it.
Thanks!

You could use file, e.g.
$ file test_unix.txt
test.txt: ASCII text
$ file test_dos.txt
test.txt: ASCII text, with CRLF line terminators

how to tell when vim has added EOL character

How I can see in vim when a file has a newline character at the end? It seems vim always shows one, whether it's truly there or not.
For example, opening a file that has a newline at the end:
echo "hi" > hi
# confirm that there are 3 characters, because of the extra newline added by echo
wc -c hi
# open in binary mode to prevent vim from adding its own newline
vim -b hi
:set list
This shows:
hi$
Now by comparison, a file without the newline:
# prevent echo from adding newline
echo -n "hi" > hi
# confirm that there are only 2 characters now
wc -c hi
# open in binary mode to prevent vim from adding its own newline
vim -b hi
:set list
Still shows:
hi$
So how can I see whether a file truly has a newline at the end or not in vim?

Vim stores this information in the 'endofline' buffer option. You can check with
:setlocal eol?

In the second case, Vim displays [noeol] when loading the file.
Don't you see the [noeol] output, when loading that file?

Vim warns you when it opens/writes a file with no <EOL> at the end of the last line:
"filename" [noeol] 4L, 38C
"filename" [noeol] 6L, 67C written

Linux replace ^M$ with $ in csv

I have received a csv file from a ftp server which I am ingesting into a table.
While ingesting the file I am receiving the error "File was a truncated file"
The actual reason is the data in a file contains $ and ^M$ in end of the line.
e.g :
ACT_RUN_TM, PROG_RUN_TM, US_HE_DT*^M$*
"CONFIRMED","","3600"$
How can I remove these $ and ^M$ from end of the line using linux command.

The ultimately correct solution is to transfer the file from the FTP server in text mode rather than binary mode, which does the appropriate end-of-line conversion for you. Change your download scripts or FTP application configuration to enable text transfers to fix this in future.
Assuming this is a one-shot transfer and you have already downloaded the file and just want to fix it, you can use tr(1) to translate characters. So to remove all control-M characters from a file, you can pipe through tr -d '\r'. Or if you want to replace them with control-J instead – for example you would do this if the file came from a pre-OSX Mac system — do tr '\r' '\n'.

It's odd to see ^M as not-the-last character, but:
sed -e 's/^M*\$$//g' <badfile >goodfile
Or use "sed -i" to update in-place.
(Note that "^M" is entered on the command line by pressing CTRL-V CTRL_M).
Update: It's been established that the question is wrong as the "^M$" are not in the file but displayed with VI. He actually wants to change CRLF pairs to just LF.
sed -e 's/^M$//g' <badfile >goodfile

multiple end of file $'s in a single file

I copy pasted some enum values from my IntelliJ IDE in windows to notepad, saved the file in a shared drive, then opened it up in a linux box. When I did cat -A on the file it showed something like:
A,B,C,^M$
D,E,F,^M$
G,H,I,^M$
After searching around I figured that ^M is the carriage return and $ means the last line of the file. I'm just puzzled at how this file is able to have multiple $'s.

From man cat on my GNU box:
-A, --show-all
equivalent to -vET
(snip)
-E, --show-ends
display $ at end of each line
Thus, there are multiple $s because there are multiple lines, each with an end.

$ is the end of line marker with cat -A, not end of file.
This is indicating the file has Windows-style line endings (carriage return followed by line feed) and not Unix-style (only line feed).
(You can convert text files from one format to the other using the programs dos2unix or unix2dos.)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to find out line-endings in a text file? - linux

Ubuntu 14.04: simple cat -e <filename> works just fine. This displays Unix line endings (\n or LF) as $ and Windows line endings (\r\n or CRLF) as ^M$.

To show CR as ^M in less use less -u or type -u once less is open. man less says: -u or --underline-special Causes backspaces and carriage returns to be treated as print- able characters; that is, they are sent to the terminal when they appear in the input.

You can use xxd to show a hex dump of the file, and hunt through for "0d0a" or "0a" chars. You can use cat -v <filename> as #warriorpostman suggests.

You may use the command todos filename to convert to DOS endings, and fromdos filename to convert to UNIX line endings. To install the package on Ubuntu, type sudo apt-get install tofrodos.

I dump my output to a text file. I then open it in notepad ++ then click the show all characters button. Not very elegant but it works.

Related

Command does not read entire file

How to check if file has special characters such as CR in linux?

how to tell when vim has added EOL character

Linux replace ^M$ with $ in csv

multiple end of file $'s in a single file

Categories

Resources