What's confusing both grep and ack? - linux

Try this: download https://www.mathworks.com/matlabcentral/fileexchange/19-delta-sigma-toolbox
In the unzipped folder, I get the following results:
ack --no-heading --no-break --matlab dsexample
Contents.m:56:% dsexample1 - Discrete-time lowpass/bandpass/quadrature modulator.
Contents.m:57:% dsexample2 - Continuous-time lowpass modulator.
dsexample1(dsm, LiveDemo);
fprintf(1,'Done.\n');
adc.sys_cs = sys_cs;
grep -nH -R --include="*.m" dsexample
Contents.m:56:% dsexample1 - Discrete-time lowpass/bandpass/quadrature modulator.
Contents.m:57:% dsexample2 - Continuous-time lowpass modulator.
dsexample1(dsm, LiveDemo); d center frequency larger Hinfation Script
fprintf(1,'Done.\n');c = c;formed.s of finite op-amp gain and capacitorased;;n for the input.
adc.sys_cs = sys_cs;snr;seed with CT simulations tora states used in the d-t model_amp); Response');
What's going on ?
[Edit for clarification]: Why is there no file name, no line number on the 3rd line result ? Why results on the 4th and 5th line do not even contain dsexample ?
NB: using ack 3.40 and grep 2.16

I do not deserve any credits for this answer - It is all about line endings.
I have known for years about Windows line endings (CR-LF) and Linux line endings (LF only), but I had never heard of Legacy MAC line endings (CR only)... The latter really upsets ack, grep, and I'm sure lots of other tools.
dos2unix and unix2dos have no effect on files with Legacy MAC format - But after using this nifty little endline tool, I could eventually bring some consistency to the source files:
endlines : 129 files converted from :
- 23 Legacy Mac (CR)
- 105 Unix (LF)
- 1 Windows (CR-LF)
Now, ack and grep are much happier.

Let's see what files contain dsexample, grep -l doesn't print the contents, just file names:
$ grep -l dsexample *
Contents.m
demoLPandBP.m
dsexample1.m
dsexample2.m
Ok, then, file shows that they have CR line terminators. (It would say "CRLF line terminators" for Windows files.)
$ file Contents.m demoLPandBP.m dsexample*
Contents.m: ASCII text
demoLPandBP.m: ASCII text, with CR line terminators
dsexample1.m: ASCII text, with CR line terminators
dsexample2.m: ASCII text, with CR line terminators
Unlike what I commented about before, Contents.m is fine. Let's look at another one, how it prints:
$ grep dsexample demoLPandBP.m
dsexample1(dsm, LiveDemo); d center frequency larger Hinf
The output from grep is actually the whole file, since grep doesn't consider the plain CR as breaking a line -- the whole file is just one line. If we change CRs to LFs, we see it better, or can just count the lines:
$ grep dsexample demoLPandBP.m | tr '\r' '\n' | wc -l
51
These are the longest lines there, in order:
%% 5th-order lowpass with optimized zeros and larger Hinf
dsm.f0 = 1/6; % Normalized center frequency
dsexample1(dsm, LiveDemo);
With a CR in the end of each, the cursor moves back to the start of the line, partially overwriting the previous output, so you get:
dsexample1(dsm, LiveDemo); d center frequency larger Hinf
(There's a space after the semicolon on that line, so the e gets overwritten too. I checked.)
Someone said dos2unix can't deal with that, and well, they're not DOS or Windows files anyway so why should it. You could do something like this, though, in Bash:
for f in *.m; do
if [[ $(file "$f") = *"ASCII text, with CR line terminators" ]]; then
tr '\r' '\n' < "$f" > tmptmptmp &&
mv tmptmptmp "$f"
fi
done
I think it was just the .m files that had the issue, hence the *.m in the loop. There was at least one PDF file there, and we don't want to break that. Though with the check on file there, it should be safe even if you just run the loop on *.

It looks like both ack and grep are getting confused by the line endings in the files. Run file *.m on your files. You'll see that some files have proper linefeeds, and some have CR line terminators.
If you clean up your line endings, things should be OK.

Related

Linux spilt text file by number of lines keep linebreaks in place

I am new to linux (not my own server) and I want to split some windows txt files by calling a bash script from a third party application:
So far I have it working in two ways up to a point:
split -l 5000 LargeFile.txt SmallFile
for file in LargeFile.*
do
mv "$file" "$file.txt"
done
awk '{filename = "wrd." int((NR-1)/5000) ".txt"; print >> filename}' LargeFile.txt
But both give me txt files with the result:
line1line2line3line4
I found some topics about putting LargeFile.txt like this $ (LargeFile.txt) but it is not working for me. (Also I found a swich to let the split command produce txt files directly, but this is also not working)
I hope some one can help me out on this one.
Explanation: Line terminators
As explained by various answers to this question, the standard line terminators differ between OS's:
Linux uses LF (line feed, 0x0a)
Windows uses CRLF (carriage return and line feed 0x0d 0x0a)
Mac, pre OS X used CR (carriage return CR)
To solve your problem, it would be important to figure out what line terminators your LargeFile.txt uses. The simplest way would be the file command:
file LargeFile.txt
The output will indicate if line terminators are CR or CRLF and otherwise just state that it is an ASCII file.
Since LF and CRLF line terminators will be recognized properly in Linux and lines should not appear merged together (no matter which way you use to view the file) unless you configure an editor specifically so that they do, I will assume that your file has CR line terminators.
Example solution to your problem (assuming CR line terminators)
If you want to split the file in the shell and with shell commands, you will potentially face the problem that the likes of cat, split, awk, etc will not recognize line endings in the first place. If your file is very large, this may additionally lead to memory issues (?).
Therefore, the best way to handle this may be to translate the line terminators first (using the tr command) so that they are understood in Linux (i.e. to LF) and then apply your split or awk code before translating the line terminators back (if you believe you need to do this).
cat LargeFile.txt | tr "\r" "\n" > temporary_file.txt
split -l 5000 temporary_file.txt SmallFile
rm temporary_file.txt
for file in `ls SmallFile*`; do filex=$file.txt; cat $file | tr "\n" "\r" > $filex; rm $file; done
Note that the last line is actually a for loop:
for file in `ls SmallFile*`
do
filex=$file.txt
cat $file | tr "\n" "\r" > $filex
rm $file
done
This loop will again use tr to restore the CR line terminators and additionally give the resulting files a txt filename ending.
Some Remarks
Of course, if you would like to keep the LF line terminators you should not execute this line.
And finally, if you find that you have a different type of line terminators, you may need to adapt the tr command in the first line.
Both tr and split (and also cat and rm) are part of GNU coreutils and should be installed on your system unless you are in a very untypical environment (a rescue shell of an initial RAM disk perhaps). The same (should typically be available) goes for the file command, this one.

head file with windows line endings

On my ContOS server, I have several huge files with windows line endings. Which, I need the first n lines from in order to run some tests.
I've tried a few standard "linux" ways of doing it:
head -10 file.dat
And
sed -n 1,10p file.dat
And
awk 'NR <=10' file.dat
All of which produce don't respect the windows line endings and simply output the entire file.
Is there a way to get the n lines of a file with windows line endings?
Also, it should be noted that the output should still have the windows line endings.
This wouldn't happen with Windows line endings, which are CRLF, since Unix uses LF. So the LF would still be seen and used.
What you're describing would happen if the line endings were just CR without LF. You can translate this with:
tr '\r' '\n' < file.dat | head -10 | tr '\n' '\r'
The first tr converts to Unix format, and the second one translates back to the original format.
You could always use vim:
vim foo.txt +"%s/\r/\r/g" +wq
This will replace all carriage returns.

text file contains lines of bizarre characters - want to fix

I'm an inexperienced programmer grappling with a new problem in a large text file which contains data I am trying to process. Here's a screen capture of what I'm looking at (using 'less' - I am on a linux server):
https://drive.google.com/file/d/0B4VAqfRxlxGpaW53THBNeGh5N2c/view?usp=sharing
Bioinformaticians will recognize this file as a "fastq" file containing DNA sequence data. The top half of the screenshot contains data in its expected format (which I admit contains some "bizarre" characters, but that is not the issue). However, the bottom half (with many characters shaded in white) is completely messed up. If I were to scroll down the file, it eventually returns to normal text after about 500 lines. I want to fix it because it is breaking downstream operations I am trying to perform (which complain about precisely this position in the file).
Is there a way to grep for and remove the shaded lines? Or can I fix this problem by somehow changing the encoding on the offending lines?
Thanks
If you are lucky, you can use
strings file > file2
Oh well, try it another way.
Determine the linelength of the correct lines (I think the first two lines are different).
head -1 file | wc -c
head -2 file | tail -1 | wc -c
Hmm, wc also counts the line-ending, substract 1 from both lengths.
Than try to read the file 1 line a time. Use a case-statement so you do not have to write a lot of else-if constructions for comparing the length to the expected length. In the code I will accept the lengths 20, 100 and 330
Redirect everything to another file outside the loop (inside will overwrite each line).
cat file | while read -r line; do
case ${#line} in
20|100|330) echo $line ;;
esac
done > file2
A total different approach would be filtering the wrong lines with sed, awk or grep but that would require knowledge what characters you will and won't accept.
Yes, when you are a lucky (wo-)man, all ugly lines will have a character in common like '<' or maybe an '#'. In that case you can use egrep:
egrep -v "<|#" file > file2
BASED ON INSPECTION OF THE SNAP
sed -r 's/<[[:alnum:]]{2}>//g;s/\^.//g;s/ESC\^*C*//g' file
to make the actual changes in the file and make a backup file with a .bak extension do
sed -r -i.bak 's/<[[:alnum:]]{2}>//g;s/\^.//g;s/ESC\^*C*//g' file

How to find a windows end of line (EOL) character

I have several hundred GB of data that I need to paste together using the unix paste utility in Cygwin, but it won't work properly if there are windows EOL characters in the files. The data may or may not have windows EOL characters, and I don't want to spend the time running dos2unix if I don't have to.
So my question is, in Cygwin, how can I figure out whether these files have windows EOL CRLF characters?
I've tried creating some test data and running
sed -r 's/\r\n//' testdata.txt
But that appears to match regardless of whether dos2unix has been run or not.
Thanks.
The file(1) utility knows the difference:
$ file * | grep ASCII
2: ASCII text
3: ASCII English text
a: ASCII C program text
blah: ASCII Java program text
foo.js: ASCII C++ program text
openssh_5.5p1-4ubuntu5.dsc: ASCII text, with very long lines
windows: ASCII text, with CRLF line terminators
file(1) has been optimized to try to read as little of a file as possible, so you may be lucky and drastically reduce the amount of disk IO you need to perform when finding and fixing the CRLF terminators.
Note that some cases of CRLF should stay in place: captures of SMTP will use CRLF. But that's up to you. :)
#!/bin/bash
for i in $(find . -type f); do
if file $i | grep CRLF ; then
echo $i
file $i
#dos2unix "$i"
fi
done
Uncomment "#dos2unix "$i"" when you are ready to convert them.
You can find out using file:
file /mnt/c/BOOT.INI
/mnt/c/BOOT.INI: ASCII text, with CRLF line terminators
CRLF is the significant value here.
If you expect the exit code to be different from sed, it won't be. It will perform a substitution or not depending on the match. The exit code will be true unless there's an error.
You can get a usable exit code from grep, however.
#!/bin/bash
for f in *
do
if head -n 10 "$f" | grep -qs $'\r'
then
dos2unix "$f"
fi
done
grep recursive, with file pattern filter
grep -Pnr --include=*file.sh '\r$' .
output file name, line number and line itself
./test/file.sh:2:here is windows line break
You can use dos2unix's -i option to get information about DOS Unix Mac line breaks (in that order), BOMs, and text/binary without converting the file.
$ dos2unix -i *.txt
6 0 0 no_bom text dos.txt
0 6 0 no_bom text unix.txt
0 0 6 no_bom text mac.txt
6 6 6 no_bom text mixed.txt
50 0 0 UTF-16LE text utf16le.txt
0 50 0 no_bom text utf8unix.txt
50 0 0 UTF-8 text utf8dos.txt
With the "c" flag dos2unix will report files that would be converted, iow files have have DOS line breaks. To report all txt files with DOS line breaks you could do this:
$ dos2unix -ic *.txt
dos.txt
mixed.txt
utf16le.txt
utf8dos.txt
To convert only these files you simply do:
dos2unix -ic *.txt | xargs dos2unix
If you need to go recursive over directories you do:
find -name '*.txt' | xargs dos2unix -ic | xargs dos2unix
See also the man page of dos2unix.
As stated above the 'file' solution works. Maybe the following code snippet may help.
#!/bin/ksh
EOL_UNKNOWN="Unknown" # Unknown EOL
EOL_MAC="Mac" # File EOL Classic Apple Mac (CR)
EOL_UNIX="Unix" # File EOL UNIX (LF)
EOL_WINDOWS="Windows" # File EOL Windows (CRLF)
SVN_PROPFILE="name-of-file" # Filename to check.
...
# Finds the EOL used in the requested File
# $1 Name of the file (requested filename)
# $r EOL_FILE set to enumerated EOL-values.
getEolFile() {
EOL_FILE=$EOL_UNKNOWN
# Check for EOL-windows
EOL_CHECK=`file $1 | grep "ASCII text, with CRLF line terminators"`
if [[ -n $EOL_CHECK ]] ; then
EOL_FILE=$EOL_WINDOWS
return
fi
# Check for Classic Mac EOL
EOL_CHECK=`file $1 | grep "ASCII text, with CR line terminators"`
if [[ -n $EOL_CHECK ]] ; then
EOL_FILE=$EOL_MAC
return
fi
# Check for Classic Mac EOL
EOL_CHECK=`file $1 | grep "ASCII text"`
if [[ -n $EOL_CHECK ]] ; then
EOL_FILE=$EOL_UNIX
return
fi
return
} # getFileEOL
...
# Using this snippet
getEolFile $SVN_PROPFILE
echo "Found EOL: $EOL_FILE"
exit -1
Thanks for the tip to use file(1) command, however it does need a bit more refinement. I had the situation where not only plain text files but also some ".sh" scripts had the wrong eol. And "file" reports them as follows regardless of eol:
xxx/y/z.sh: application/x-shellscript
So the "file -e soft" option was needed (at least for Linux):
bash$ find xxx -exec file -e soft {} \; | grep CRLF
This finds all the files with DOS eol in directory xxx and subdirs.

How do you search for files containing DOS line endings (CRLF) with grep on Linux?

I want to search for files containing DOS line endings with grep on Linux. Something like this:
grep -IUr --color '\r\n' .
The above seems to match for literal rn which is not what is desired.
The output of this will be piped through xargs into todos to convert crlf to lf like this
grep -IUrl --color '^M' . | xargs -ifile fromdos 'file'
grep probably isn't the tool you want for this. It will print a line for every matching line in every file. Unless you want to, say, run todos 10 times on a 10 line file, grep isn't the best way to go about it. Using find to run file on every file in the tree then grepping through that for "CRLF" will get you one line of output for each file which has dos style line endings:
find . -not -type d -exec file "{}" ";" | grep CRLF
will get you something like:
./1/dos1.txt: ASCII text, with CRLF line terminators
./2/dos2.txt: ASCII text, with CRLF line terminators
./dos.txt: ASCII text, with CRLF line terminators
Use Ctrl+V, Ctrl+M to enter a literal Carriage Return character into your grep string. So:
grep -IUr --color "^M"
will work - if the ^M there is a literal CR that you input as I suggested.
If you want the list of files, you want to add the -l option as well.
Explanation
-I ignore binary files
-U prevents grep from stripping CR characters. By default it does this it if it decides it's a text file.
-r read all files under each directory recursively.
Using RipGrep (depending on your shell, you might need to quote the last argument):
rg -l \r
-l, --files-with-matches
Only print the paths with at least one match.
https://github.com/BurntSushi/ripgrep
If your version of grep supports -P (--perl-regexp) option, then
grep -lUP '\r$'
could be used.
# list files containing dos line endings (CRLF)
cr="$(printf "\r")" # alternative to ctrl-V ctrl-M
grep -Ilsr "${cr}$" .
grep -Ilsr $'\r$' . # yet another & even shorter alternative
dos2unix has a file information option which can be used to show the files that would be converted:
dos2unix -ic /path/to/file
To do that recursively you can use bash’s globstar option, which for the current shell is enabled with shopt -s globstar:
dos2unix -ic ** # all files recursively
dos2unix -ic **/file # files called “file” recursively
Alternatively you can use find for that:
find -type f -exec dos2unix -ic {} + # all files recursively (ignoring directories)
find -name file -exec dos2unix -ic {} + # files called “file” recursively
You can use file command in unix. It gives you the character encoding of the file along with line terminators.
$ file myfile
myfile: ISO-8859 text, with CRLF line terminators
$ file myfile | grep -ow CRLF
CRLF
The query was search... I have a similar issue... somebody submitted mixed line
endings into the version control, so now we have a bunch of files with 0x0d
0x0d 0x0a line endings. Note that
grep -P '\x0d\x0a'
finds all lines, whereas
grep -P '\x0d\x0d\x0a'
and
grep -P '\x0d\x0d'
finds no lines so there may be something "else" going on inside grep
when it comes to line ending patterns... unfortunately for me!
If, like me, your minimalist unix doesn't include niceties like the file command, and backslashes in your grep expressions just don't cooperate, try this:
$ for file in `find . -type f` ; do
> dump $file | cut -c9-50 | egrep -m1 -q ' 0d| 0d'
> if [ $? -eq 0 ] ; then echo $file ; fi
> done
Modifications you may want to make to the above include:
tweak the find command to locate only the files you want to scan
change the dump command to od or whatever file dump utility you have
confirm that the cut command includes both a leading and trailing space as well as just the hexadecimal character output from the dump utility
limit the dump output to the first 1000 characters or so for efficiency
For example, something like this may work for you using od instead of dump:
od -t x2 -N 1000 $file | cut -c8- | egrep -m1 -q ' 0d| 0d|0d$'

Resources