How to find a windows end of line (EOL) character - linux

I have several hundred GB of data that I need to paste together using the unix paste utility in Cygwin, but it won't work properly if there are windows EOL characters in the files. The data may or may not have windows EOL characters, and I don't want to spend the time running dos2unix if I don't have to.
So my question is, in Cygwin, how can I figure out whether these files have windows EOL CRLF characters?
I've tried creating some test data and running
sed -r 's/\r\n//' testdata.txt
But that appears to match regardless of whether dos2unix has been run or not.
Thanks.

The file(1) utility knows the difference:
$ file * | grep ASCII
2: ASCII text
3: ASCII English text
a: ASCII C program text
blah: ASCII Java program text
foo.js: ASCII C++ program text
openssh_5.5p1-4ubuntu5.dsc: ASCII text, with very long lines
windows: ASCII text, with CRLF line terminators
file(1) has been optimized to try to read as little of a file as possible, so you may be lucky and drastically reduce the amount of disk IO you need to perform when finding and fixing the CRLF terminators.
Note that some cases of CRLF should stay in place: captures of SMTP will use CRLF. But that's up to you. :)

#!/bin/bash
for i in $(find . -type f); do
if file $i | grep CRLF ; then
echo $i
file $i
#dos2unix "$i"
fi
done
Uncomment "#dos2unix "$i"" when you are ready to convert them.

You can find out using file:
file /mnt/c/BOOT.INI
/mnt/c/BOOT.INI: ASCII text, with CRLF line terminators
CRLF is the significant value here.

If you expect the exit code to be different from sed, it won't be. It will perform a substitution or not depending on the match. The exit code will be true unless there's an error.
You can get a usable exit code from grep, however.
#!/bin/bash
for f in *
do
if head -n 10 "$f" | grep -qs $'\r'
then
dos2unix "$f"
fi
done

grep recursive, with file pattern filter
grep -Pnr --include=*file.sh '\r$' .
output file name, line number and line itself
./test/file.sh:2:here is windows line break

You can use dos2unix's -i option to get information about DOS Unix Mac line breaks (in that order), BOMs, and text/binary without converting the file.
$ dos2unix -i *.txt
6 0 0 no_bom text dos.txt
0 6 0 no_bom text unix.txt
0 0 6 no_bom text mac.txt
6 6 6 no_bom text mixed.txt
50 0 0 UTF-16LE text utf16le.txt
0 50 0 no_bom text utf8unix.txt
50 0 0 UTF-8 text utf8dos.txt
With the "c" flag dos2unix will report files that would be converted, iow files have have DOS line breaks. To report all txt files with DOS line breaks you could do this:
$ dos2unix -ic *.txt
dos.txt
mixed.txt
utf16le.txt
utf8dos.txt
To convert only these files you simply do:
dos2unix -ic *.txt | xargs dos2unix
If you need to go recursive over directories you do:
find -name '*.txt' | xargs dos2unix -ic | xargs dos2unix
See also the man page of dos2unix.

As stated above the 'file' solution works. Maybe the following code snippet may help.
#!/bin/ksh
EOL_UNKNOWN="Unknown" # Unknown EOL
EOL_MAC="Mac" # File EOL Classic Apple Mac (CR)
EOL_UNIX="Unix" # File EOL UNIX (LF)
EOL_WINDOWS="Windows" # File EOL Windows (CRLF)
SVN_PROPFILE="name-of-file" # Filename to check.
...
# Finds the EOL used in the requested File
# $1 Name of the file (requested filename)
# $r EOL_FILE set to enumerated EOL-values.
getEolFile() {
EOL_FILE=$EOL_UNKNOWN
# Check for EOL-windows
EOL_CHECK=`file $1 | grep "ASCII text, with CRLF line terminators"`
if [[ -n $EOL_CHECK ]] ; then
EOL_FILE=$EOL_WINDOWS
return
fi
# Check for Classic Mac EOL
EOL_CHECK=`file $1 | grep "ASCII text, with CR line terminators"`
if [[ -n $EOL_CHECK ]] ; then
EOL_FILE=$EOL_MAC
return
fi
# Check for Classic Mac EOL
EOL_CHECK=`file $1 | grep "ASCII text"`
if [[ -n $EOL_CHECK ]] ; then
EOL_FILE=$EOL_UNIX
return
fi
return
} # getFileEOL
...
# Using this snippet
getEolFile $SVN_PROPFILE
echo "Found EOL: $EOL_FILE"
exit -1

Thanks for the tip to use file(1) command, however it does need a bit more refinement. I had the situation where not only plain text files but also some ".sh" scripts had the wrong eol. And "file" reports them as follows regardless of eol:
xxx/y/z.sh: application/x-shellscript
So the "file -e soft" option was needed (at least for Linux):
bash$ find xxx -exec file -e soft {} \; | grep CRLF
This finds all the files with DOS eol in directory xxx and subdirs.

Related

What's confusing both grep and ack?

Try this: download https://www.mathworks.com/matlabcentral/fileexchange/19-delta-sigma-toolbox
In the unzipped folder, I get the following results:
ack --no-heading --no-break --matlab dsexample
Contents.m:56:% dsexample1 - Discrete-time lowpass/bandpass/quadrature modulator.
Contents.m:57:% dsexample2 - Continuous-time lowpass modulator.
dsexample1(dsm, LiveDemo);
fprintf(1,'Done.\n');
adc.sys_cs = sys_cs;
grep -nH -R --include="*.m" dsexample
Contents.m:56:% dsexample1 - Discrete-time lowpass/bandpass/quadrature modulator.
Contents.m:57:% dsexample2 - Continuous-time lowpass modulator.
dsexample1(dsm, LiveDemo); d center frequency larger Hinfation Script
fprintf(1,'Done.\n');c = c;formed.s of finite op-amp gain and capacitorased;;n for the input.
adc.sys_cs = sys_cs;snr;seed with CT simulations tora states used in the d-t model_amp); Response');
What's going on ?
[Edit for clarification]: Why is there no file name, no line number on the 3rd line result ? Why results on the 4th and 5th line do not even contain dsexample ?
NB: using ack 3.40 and grep 2.16
I do not deserve any credits for this answer - It is all about line endings.
I have known for years about Windows line endings (CR-LF) and Linux line endings (LF only), but I had never heard of Legacy MAC line endings (CR only)... The latter really upsets ack, grep, and I'm sure lots of other tools.
dos2unix and unix2dos have no effect on files with Legacy MAC format - But after using this nifty little endline tool, I could eventually bring some consistency to the source files:
endlines : 129 files converted from :
- 23 Legacy Mac (CR)
- 105 Unix (LF)
- 1 Windows (CR-LF)
Now, ack and grep are much happier.
Let's see what files contain dsexample, grep -l doesn't print the contents, just file names:
$ grep -l dsexample *
Contents.m
demoLPandBP.m
dsexample1.m
dsexample2.m
Ok, then, file shows that they have CR line terminators. (It would say "CRLF line terminators" for Windows files.)
$ file Contents.m demoLPandBP.m dsexample*
Contents.m: ASCII text
demoLPandBP.m: ASCII text, with CR line terminators
dsexample1.m: ASCII text, with CR line terminators
dsexample2.m: ASCII text, with CR line terminators
Unlike what I commented about before, Contents.m is fine. Let's look at another one, how it prints:
$ grep dsexample demoLPandBP.m
dsexample1(dsm, LiveDemo); d center frequency larger Hinf
The output from grep is actually the whole file, since grep doesn't consider the plain CR as breaking a line -- the whole file is just one line. If we change CRs to LFs, we see it better, or can just count the lines:
$ grep dsexample demoLPandBP.m | tr '\r' '\n' | wc -l
51
These are the longest lines there, in order:
%% 5th-order lowpass with optimized zeros and larger Hinf
dsm.f0 = 1/6; % Normalized center frequency
dsexample1(dsm, LiveDemo);
With a CR in the end of each, the cursor moves back to the start of the line, partially overwriting the previous output, so you get:
dsexample1(dsm, LiveDemo); d center frequency larger Hinf
(There's a space after the semicolon on that line, so the e gets overwritten too. I checked.)
Someone said dos2unix can't deal with that, and well, they're not DOS or Windows files anyway so why should it. You could do something like this, though, in Bash:
for f in *.m; do
if [[ $(file "$f") = *"ASCII text, with CR line terminators" ]]; then
tr '\r' '\n' < "$f" > tmptmptmp &&
mv tmptmptmp "$f"
fi
done
I think it was just the .m files that had the issue, hence the *.m in the loop. There was at least one PDF file there, and we don't want to break that. Though with the check on file there, it should be safe even if you just run the loop on *.
It looks like both ack and grep are getting confused by the line endings in the files. Run file *.m on your files. You'll see that some files have proper linefeeds, and some have CR line terminators.
If you clean up your line endings, things should be OK.

Counting number of characters in a file through shell script

I want to check the no of characters in a file from starting to EOF character. Can anyone tell me how to do this through shell script
This will do it for counting bytes in file:
wc -c filename
If you want only the count without the filename being repeated in the output:
wc -c < filename
This will count characters in multibyte files (Unicode etc.):
wc -m filename
(as shown in Sébastien's answer).
#!/bin/sh
wc -m $1 | awk '{print $1}'
wc -m counts the number of characters; the awk command prints the number of characters only, omitting the filename.
wc -c would give you the number of bytes (which can be different to the number of characters, as depending on the encoding you may have a character encoded on several bytes).
To get exact character count of string, use printf, as opposed to echo, cat, or running wc -c directly on a file, because using echo, cat, etc will count a newline character, which will give you the amount of characters including the newline character. So a file with the text 'hello' will print 6 if you use echo etc, but if you use printf it will return the exact 5, because theres no newline element to count.
How to use printf for counting characters within strings:
$printf '6chars' | wc -m
6
To turn this into a script you can run on a text file to count characters, save the following in a file called print-character-amount.sh:
#!/bin/bash
characters=$(cat "$1")
printf "$characters" | wc -m
chmod +x on file print-character-amount.sh containing above text, place the file in your PATH (i.e. /usr/bin/ or any directory exported as PATH in your .bashrc file) then to run script on text file type:
print-character-amount.sh file-to-count-characters-of.txt
awk '{t+=length($0)}END{print t}' file3
awk only
awk 'BEGIN{FS=""}{for(i=1;i<=NF;i++)c++}END{print "total chars:"c}' file
shell only
var=$(<file)
echo ${#var}
Ruby(1.9+)
ruby -0777 -ne 'print $_.size' file
The following script is tested and gives exactly the results, that are expected
\#!/bin/bash
echo "Enter the file name"
read file
echo "enter the word to be found"
read word
count=0
for i in \`cat $file`
do
if [ $i == $word ]
then
count=\`expr $count + 1`
fi
done
echo "The number of words are $count"
I would have thought that it would be better to use stat to find the size of a file, since the filesystem knows it already, rather than causing the whole file to have to be read with awk or wc - especially if it is a multi-GB file or one that may be non-resident in the file-system on an HSM.
stat -c%s file
Yes, I concede it doesn't account for multi-byte characters, but would add that the OP has never clarified whether that is/was an issue.
Credits to user.py et al.
echo "ää" > /tmp/your_file.txt
cat /tmp/your_file.txt | wc -m
results in 3.
In my example the result is expected to be 2 (twice the letter ä). However, echo (or vi) adds a line break \n to the end of the output (or file). So two ä and one Linux line break \n are counted. That's three together.
Working with pipes | is not the shortest variant, but so I have to know less wc parameters by heart. In addition, cat is bullet-proof in my experience.
Tested on Ubuntu 18.04.1 LTS (Bionic Beaver).

How to find dos format files in a linux file system

I would like to find out which of my files in a directory are dos text files (as opposed to unix text files).
What I've tried:
find . -name "*.php" | xargs grep ^M -l
It's not giving me reliable results... so I'm looking for a better alternative.
Any suggestions, ideas?
Thanks
Clarification
In addition to what I've said above, the problem is that i have a bunch of dos files with no ^M characters in them (hence my note about reliability).
The way i currently determine whether a file is dos or not is through Vim, where at the bottom it says:
"filename.php" [dos] [noeol]
How about:
find . -name "*.php" | xargs file | grep "CRLF"
I don't think it is reliable to try and use ^M to try and find the files.
Not sure what you mean exactly by "not reliable" but you may want to try:
find . -name '*.php' -print0 | xargs -0 grep -l '^M$'
This uses the more atrocious-filenames-with-spaces-in-them-friendly options and only finds carriage returns immediately before the end of line.
Keep in mind that the ^M is a single CTRLM character, not two characters.
And also that it'll list files where even one line is in DOS mode, which is probably what you want anyway since those would have been UNIX files mangled by a non-UNIX editor.
Based on your update that vim is reporting your files as DOS format:
If vim is reporting it as DOS format, then every line ends with CRLF. That's the way vim works. If even one line doesn't have CR, then it's considered UNIX format and the ^M characters are visible in the buffer. If it's all DOS format, the ^M characters are not displayed:
Vim will look for both dos and unix line endings, but Vim has a built-in preference for the unix format.
- If all lines in the file end with CRLF, the dos file format will be applied, meaning that each CRLF is removed when reading the lines into a buffer, and the buffer 'ff' option will be dos.
- If one or more lines end with LF only, the unix file format will be applied, meaning that each LF is removed (but each CR will be present in the buffer, and will display as ^M), and the buffer 'ff' option will be unix.
If you really want to know what's in the file, don't rely on a too-smart tool like vim :-)
Use:
od -xcb input_file_name | less
and check the line endings yourself.
i had good luck with
find . -name "*.php" -exec grep -Pl "\r" {} \;
This is much like your original solution; therefore, it's possibly more easy for you to remember:
find . -name "*.php" | xargs grep "\r" -l
Thought process:
In VIM, to remove the ^M you type:
%s:/^M//g
Where ^ is your Ctrl key and M is the ENTER key. But I could never remember the keys to type to print that sequence, so I've always removed them using:
%s:/\r//g
So my deduction is that the \r and ^M are equivalent, with the former being easier to remember to type.
If your dos2unix command has the -i option, you can use that feature to find files in a directory that have DOS line breaks.
$ man dos2unix
.
.
.
-i[FLAGS], --info[=FLAGS] FILE ...
Display file information. No conversion is done.
The following information is printed, in this order:
number of DOS line breaks,
number of Unix line breaks,
number of Mac line breaks,
byte order mark,
text or binary, file name.
.
.
.
Optionally extra flags can be set to change the (-i) output.
.
.
.
c Print only the files that would be converted.
The following one-liner script reads:
find all files in this directory tree,
run dos2unix on all files to determine the files to be changed,
run dos2unix on files to be changed
$ find . -type f | xargs -d '\n' dos2unix -ic | xargs -d '\n' dos2unix
I've been using cat -e to see what line endings files have.
Using ^M as a single CTRLM character didn't really work out for me (it works as if I just press return, without actually inserting the non-printable ^M line ending — tested with echo 'CTRLM' | cat -e), so what I ended up doing will probably seem too much, but it did the job nevertheless:
grep '$' *.php | cat -e | grep '\^M\$' | sed 's/:.*//' | uniq
, where
the first grep just prepends filenames to each line of each file (can be replaced with awk '{print FILENAME, $0}', but grep worked faster on my set of files);
cat -e explicitly prints non-printable line endings;
the second grep finds lines ending with ^M$, and ^M are two characters;
the sed part keeps only the file names (can be replaced with cut -d ':' -f 1);
uniq just keeps each file name once.
GNU find
find . -type f -iname "*.php" -exec file "{}" + | grep CRLF
I don't know what you want to do after you find those DOS php files, but if you want to convert them to unix format, then
find . -type f -iname "*.php" -exec dos2unix "{}" +;
will suffice. There's no need to specifically check whether they are DOS files or not.
If you prefer vim to tell you which files are in this format you can use the following script:
"use this script to check which files are in dos format according to vim
"use: in the folder that you want to check
"create a file, say res.txt
"> vim -u NONE --noplugins res.txt
"> in vim: source this_script.vim
python << EOF
import os
import vim
cur_buf = vim.current.buffer
IGNORE_START = ''.split()
IGNORE_END = '.pyc .swp .png ~'.split()
IGNORE_DIRS = '.hg .git dd_ .bzr'.split()
for dirpath, dirnames, fnames in os.walk(os.curdir):
for dirn in dirnames:
for diri in IGNORE_DIRS:
if dirn.endswith(diri):
dirnames.remove(dirn)
break
for fname in fnames:
skip = False
for fstart in IGNORE_START:
if fname.startswith(fstart):
skip = True
for fend in IGNORE_END:
if fname.endswith(fend):
skip = True
if skip is True:
continue
fname = os.path.join(dirpath, fname)
vim.command('view {}'.format(fname))
curr_ff = vim.eval('&ff')
if vim.current.buffer != cur_buf:
vim.command('bw!')
if curr_ff == 'dos':
cur_buf.append('{} {}'.format(curr_ff, fname))
EOF
your vim needs to be compiled with python (python is used to loop over the files in the folder, there is probably an easier way of doing this, but I don't really know it....

Quick unix command to display specific lines in the middle of a file?

Trying to debug an issue with a server and my only log file is a 20GB log file (with no timestamps even! Why do people use System.out.println() as logging? In production?!)
Using grep, I've found an area of the file that I'd like to take a look at, line 347340107.
Other than doing something like
head -<$LINENUM + 10> filename | tail -20
... which would require head to read through the first 347 million lines of the log file, is there a quick and easy command that would dump lines 347340100 - 347340200 (for example) to the console?
update I totally forgot that grep can print the context around a match ... this works well. Thanks!
I found two other solutions if you know the line number but nothing else (no grep possible):
Assuming you need lines 20 to 40,
sed -n '20,40p;41q' file_name
or
awk 'FNR>=20 && FNR<=40' file_name
When using sed it is more efficient to quit processing after having printed the last line than continue processing until the end of the file. This is especially important in the case of large files and printing lines at the beginning. In order to do so, the sed command above introduces the instruction 41q in order to stop processing after line 41 because in the example we are interested in lines 20-40 only. You will need to change the 41 to whatever the last line you are interested in is, plus one.
# print line number 52
sed -n '52p' # method 1
sed '52!d' # method 2
sed '52q;d' # method 3, efficient on large files
method 3 efficient on large files
fastest way to display specific lines
with GNU-grep you could just say
grep --context=10 ...
No there isn't, files are not line-addressable.
There is no constant-time way to find the start of line n in a text file. You must stream through the file and count newlines.
Use the simplest/fastest tool you have to do the job. To me, using head makes much more sense than grep, since the latter is way more complicated. I'm not saying "grep is slow", it really isn't, but I would be surprised if it's faster than head for this case. That'd be a bug in head, basically.
What about:
tail -n +347340107 filename | head -n 100
I didn't test it, but I think that would work.
I prefer just going into less and
typing 50% to goto halfway the file,
43210G to go to line 43210
:43210 to do the same
and stuff like that.
Even better: hit v to start editing (in vim, of course!), at that location. Now, note that vim has the same key bindings!
You can use the ex command, a standard Unix editor (part of Vim now), e.g.
display a single line (e.g. 2nd one):
ex +2p -scq file.txt
corresponding sed syntax: sed -n '2p' file.txt
range of lines (e.g. 2-5 lines):
ex +2,5p -scq file.txt
sed syntax: sed -n '2,5p' file.txt
from the given line till the end (e.g. 5th to the end of the file):
ex +5,p -scq file.txt
sed syntax: sed -n '2,$p' file.txt
multiple line ranges (e.g. 2-4 and 6-8 lines):
ex +2,4p +6,8p -scq file.txt
sed syntax: sed -n '2,4p;6,8p' file.txt
Above commands can be tested with the following test file:
seq 1 20 > file.txt
Explanation:
+ or -c followed by the command - execute the (vi/vim) command after file has been read,
-s - silent mode, also uses current terminal as a default output,
q followed by -c is the command to quit editor (add ! to do force quit, e.g. -scq!).
I'd first split the file into few smaller ones like this
$ split --lines=50000 /path/to/large/file /path/to/output/file/prefix
and then grep on the resulting files.
If your line number is 100 to read
head -100 filename | tail -1
Get ack
Ubuntu/Debian install:
$ sudo apt-get install ack-grep
Then run:
$ ack --lines=$START-$END filename
Example:
$ ack --lines=10-20 filename
From $ man ack:
--lines=NUM
Only print line NUM of each file. Multiple lines can be given with multiple --lines options or as a comma separated list (--lines=3,5,7). --lines=4-7 also works.
The lines are always output in ascending order, no matter the order given on the command line.
sed will need to read the data too to count the lines.
The only way a shortcut would be possible would there to be context/order in the file to operate on. For example if there were log lines prepended with a fixed width time/date etc.
you could use the look unix utility to binary search through the files for particular dates/times
Use
x=`cat -n <file> | grep <match> | awk '{print $1}'`
Here you will get the line number where the match occurred.
Now you can use the following command to print 100 lines
awk -v var="$x" 'NR>=var && NR<=var+100{print}' <file>
or you can use "sed" as well
sed -n "${x},${x+100}p" <file>
With sed -e '1,N d; M q' you'll print lines N+1 through M. This is probably a bit better then grep -C as it doesn't try to match lines to a pattern.
Building on Sklivvz' answer, here's a nice function one can put in a .bash_aliases file. It is efficient on huge files when printing stuff from the front of the file.
function middle()
{
startidx=$1
len=$2
endidx=$(($startidx+$len))
filename=$3
awk "FNR>=${startidx} && FNR<=${endidx} { print NR\" \"\$0 }; FNR>${endidx} { print \"END HERE\"; exit }" $filename
}
To display a line from a <textfile> by its <line#>, just do this:
perl -wne 'print if $. == <line#>' <textfile>
If you want a more powerful way to show a range of lines with regular expressions -- I won't say why grep is a bad idea for doing this, it should be fairly obvious -- this simple expression will show you your range in a single pass which is what you want when dealing with ~20GB text files:
perl -wne 'print if m/<regex1>/ .. m/<regex2>/' <filename>
(tip: if your regex has / in it, use something like m!<regex>! instead)
This would print out <filename> starting with the line that matches <regex1> up until (and including) the line that matches <regex2>.
It doesn't take a wizard to see how a few tweaks can make it even more powerful.
Last thing: perl, since it is a mature language, has many hidden enhancements to favor speed and performance. With this in mind, it makes it the obvious choice for such an operation since it was originally developed for handling large log files, text, databases, etc.
print line 5
sed -n '5p' file.txt
sed '5q' file.txt
print everything else than line 5
`sed '5d' file.txt
and my creation using google
#!/bin/bash
#removeline.sh
#remove deleting it comes move line xD
usage() { # Function: Print a help message.
echo "Usage: $0 -l LINENUMBER -i INPUTFILE [ -o OUTPUTFILE ]"
echo "line is removed from INPUTFILE"
echo "line is appended to OUTPUTFILE"
}
exit_abnormal() { # Function: Exit with error.
usage
exit 1
}
while getopts l:i:o:b flag
do
case "${flag}" in
l) line=${OPTARG};;
i) input=${OPTARG};;
o) output=${OPTARG};;
esac
done
if [ -f tmp ]; then
echo "Temp file:tmp exist. delete it yourself :)"
exit
fi
if [ -f "$input" ]; then
re_isanum='^[0-9]+$'
if ! [[ $line =~ $re_isanum ]] ; then
echo "Error: LINENUMBER must be a positive, whole number."
exit 1
elif [ $line -eq "0" ]; then
echo "Error: LINENUMBER must be greater than zero."
exit_abnormal
fi
if [ ! -z $output ]; then
sed -n "${line}p" $input >> $output
fi
if [ ! -z $input ]; then
# remove this sed command and this comes move line to other file
sed "${line}d" $input > tmp && cp tmp $input
fi
fi
if [ -f tmp ]; then
rm tmp
fi
You could try this command:
egrep -n "*" <filename> | egrep "<line number>"
Easy with perl! If you want to get line 1, 3 and 5 from a file, say /etc/passwd:
perl -e 'while(<>){if(++$l~~[1,3,5]){print}}' < /etc/passwd
I am surprised only one other answer (by Ramana Reddy) suggested to add line numbers to the output. The following searches for the required line number and colours the output.
file=FILE
lineno=LINENO
wb="107"; bf="30;1"; rb="101"; yb="103"
cat -n ${file} | { GREP_COLORS="se=${wb};${bf}:cx=${wb};${bf}:ms=${rb};${bf}:sl=${yb};${bf}" grep --color -C 10 "^[[:space:]]\\+${lineno}[[:space:]]"; }

How do you search for files containing DOS line endings (CRLF) with grep on Linux?

I want to search for files containing DOS line endings with grep on Linux. Something like this:
grep -IUr --color '\r\n' .
The above seems to match for literal rn which is not what is desired.
The output of this will be piped through xargs into todos to convert crlf to lf like this
grep -IUrl --color '^M' . | xargs -ifile fromdos 'file'
grep probably isn't the tool you want for this. It will print a line for every matching line in every file. Unless you want to, say, run todos 10 times on a 10 line file, grep isn't the best way to go about it. Using find to run file on every file in the tree then grepping through that for "CRLF" will get you one line of output for each file which has dos style line endings:
find . -not -type d -exec file "{}" ";" | grep CRLF
will get you something like:
./1/dos1.txt: ASCII text, with CRLF line terminators
./2/dos2.txt: ASCII text, with CRLF line terminators
./dos.txt: ASCII text, with CRLF line terminators
Use Ctrl+V, Ctrl+M to enter a literal Carriage Return character into your grep string. So:
grep -IUr --color "^M"
will work - if the ^M there is a literal CR that you input as I suggested.
If you want the list of files, you want to add the -l option as well.
Explanation
-I ignore binary files
-U prevents grep from stripping CR characters. By default it does this it if it decides it's a text file.
-r read all files under each directory recursively.
Using RipGrep (depending on your shell, you might need to quote the last argument):
rg -l \r
-l, --files-with-matches
Only print the paths with at least one match.
https://github.com/BurntSushi/ripgrep
If your version of grep supports -P (--perl-regexp) option, then
grep -lUP '\r$'
could be used.
# list files containing dos line endings (CRLF)
cr="$(printf "\r")" # alternative to ctrl-V ctrl-M
grep -Ilsr "${cr}$" .
grep -Ilsr $'\r$' . # yet another & even shorter alternative
dos2unix has a file information option which can be used to show the files that would be converted:
dos2unix -ic /path/to/file
To do that recursively you can use bash’s globstar option, which for the current shell is enabled with shopt -s globstar:
dos2unix -ic ** # all files recursively
dos2unix -ic **/file # files called “file” recursively
Alternatively you can use find for that:
find -type f -exec dos2unix -ic {} + # all files recursively (ignoring directories)
find -name file -exec dos2unix -ic {} + # files called “file” recursively
You can use file command in unix. It gives you the character encoding of the file along with line terminators.
$ file myfile
myfile: ISO-8859 text, with CRLF line terminators
$ file myfile | grep -ow CRLF
CRLF
The query was search... I have a similar issue... somebody submitted mixed line
endings into the version control, so now we have a bunch of files with 0x0d
0x0d 0x0a line endings. Note that
grep -P '\x0d\x0a'
finds all lines, whereas
grep -P '\x0d\x0d\x0a'
and
grep -P '\x0d\x0d'
finds no lines so there may be something "else" going on inside grep
when it comes to line ending patterns... unfortunately for me!
If, like me, your minimalist unix doesn't include niceties like the file command, and backslashes in your grep expressions just don't cooperate, try this:
$ for file in `find . -type f` ; do
> dump $file | cut -c9-50 | egrep -m1 -q ' 0d| 0d'
> if [ $? -eq 0 ] ; then echo $file ; fi
> done
Modifications you may want to make to the above include:
tweak the find command to locate only the files you want to scan
change the dump command to od or whatever file dump utility you have
confirm that the cut command includes both a leading and trailing space as well as just the hexadecimal character output from the dump utility
limit the dump output to the first 1000 characters or so for efficiency
For example, something like this may work for you using od instead of dump:
od -t x2 -N 1000 $file | cut -c8- | egrep -m1 -q ' 0d| 0d|0d$'

Resources