I am on Linux. I have received a mixed list of files, which I have forgotten to verify beforehand. My editor (emacs) has used LF (\n) for some files which originally had CR+LF (\r\n) (!!). I have realized about this way too late, and I think this is causing me trouble.
I would like to find all files in my cwd which have at least one CR+LF in them. I do not trust the file command, because I think it only checks the first lines, and not the whole file.
I would like to check whole files to look for CR + LF. Is there a tool for that, or do I need to roll my own?
You can use this grep command to list all the files in a directory with at least one CR-LF:
grep -l $'\r$' *
Pattern $'\r$' will file \r just before end of each line.
Or using hex value:
grep -l $'\x0D$' *
Where \x0D will find \r (ASCII: 13).
dos2unix can not only convert dos line ends (CR+LF) to unis (LF) but also display file information with a -i option. e.g.
sh-4.3$ (echo "1" ; echo "") > 123.txt
sh-4.3$ unix2dos 123.txt
unix2dos: converting file 123.txt to DOS format...
sh-4.3$ cat 123.txt ; hexdump -C 123.txt ; dos2unix --info='du' 123.txt
1
00000000 31 0d 0a 0d 0a |1....|
00000005
2 0 123.txt
sh-4.3$ dos2unix 123.txt
dos2unix: converting file 123.txt to Unix format...
sh-4.3$ cat 123.txt ; hexdump -C 123.txt ; dos2unix --info='du' 123.txt
1
00000000 31 0a 0a |1..|
00000003
0 2 123.txt
Related
I have a file 1.txt
$ cat 1.txt
page1
рage1
But:
$ head -n1 1.txt | file -i -
/dev/stdin: text/plain; charset=us-ascii
$ head -n2 1.txt | tail -n1 | file -i -
/dev/stdin: text/plain; charset=utf-8
Strings have different charset. Because of it I can't get unique string with the method i know:
$ cat 1.txt | sort | uniq -c | sort -rn
1 рage1
1 page1
So, can you help me to find the way how to get only unique string in my situation?
P.S. Prefer solutions only with linux command line/bash/awk.
But if you have the solution in another programming language, I'd like it too.
Upd. awk '!a[$0]++' Input_file don't work, pic:
A cursory examination of what we have here:
$ cat 1.txt
page1
рage1
$ hd 1.txt
00000000 70 61 67 65 31 0a d1 80 61 67 65 31 0a |page1...age1.|
0000000d
As noted in the comments to the question, that second "рage1" is indeed distinct from the previous "page1" for a reason: that's not a Latin p, it's a Cyrillic р, so a uniqueness filter should call them out as separate unless you normalize the text beforehand.
iconv won't do the trick here. uconv (e.g. apt install icu-devtools on Debian/Ubuntu) will get you close, but its transliteration mappings are based on phonetics rather than lookalike characters, so when we transliterate this example, the Cyrillic р becomes a Latin r:
$ uconv -x Cyrillic-Latin 1.txt
page1
rage1
See also these more complex uconv commands, which have similar results.
The ICU uconv man page states
uconv can also run the specified transliteration on the transcoded data, in which case transliteration will happen as an intermediate step, after the data have been transcoded to Unicode. The transliteration can be either a list of semicolon-separated transliterator names, or an arbitrarily complex set of rules in the ICU transliteration rules format.
This implies that somebody could use the "ICU transliteration rules format" to specify a lookalike character mapping. Of course, at that rate, you could use whatever language you want.
I also tried perl's Text::Unidecode, but that has its own (similar) issues:
$ perl -Mutf8 -MText::Unidecode -pe '$_ = unidecode($_)' 1.txt
page1
NEURage1
That might work better in some cases, but obviously this isn't one of them.
I have an UTF-16 encoded file and I want replace UNIX line endings with Windows line endings. I don't want to touch anything else.
Is there a linux command line tool that can search for two bytes "0A 00" and replace it with four bytes "0D 00 0A 00"?
Perl to the rescue:
perl -we 'binmode STDIN, ":encoding(UTF-16le)";
binmode STDOUT, ":encoding(UTF-16le):crlf";
print while <STDIN>;
' < input.txt > output.txt
You may use unix2dos, but you have to convert the file to a 8-bit encoding before, and back to UTF-16 after. The obvious intermediate candidate is UTF-8:
$ cat in.txt | iconv -f UTF-16 -t UTF-8 | unix2dos | iconv -f UTF-8 -t UTF-16 > out.txt
You can wrap these three piped commands in a handy script, if you wish.
#/bin/sh
iconv -f UTF-16 -t UTF-8 | unix2dos | iconv -f UTF-8 -t UTF-16
unix2dos is what you're looking for. See its different options to find the one that's right for your UTF-16 encoding.
Solution:
perl -pe "BEGIN { binmode $_, ':raw:encoding(UTF-16LE)' for *STDIN, *STDOUT }; s/\n\0/\r\0\n\0/g;" < input.file > output.file
Credit to my coworker Manu and Stream-process UTF-16 file with BOM and Unix line endings in Windows perl
I have two files and I want to see if the first 40 bytes are similar. How can I do this using hex dump?
If you are using the BSD hexdump utility (which will also be installed as hd, with a different default output format) then you can supply the -n40 command line parameter to limit the dump to the first 40 bytes:
hexdump -n40 filename
If you are using the Posix standard od, you need a capital N. You might find the following invocation useful:
od -N40 -w40 -tx1 -Ax filename
(You can do that with hexdump, too, but the format string is more work to figure out :) ).
Try this:
head -c 40 myfile | hexdump
Not sure why you need hexdump here,
diff <(dd bs=1 count=40 if=file1) <(dd bs=1 count=40 if=file2)
with hexdump:
diff <(dd bs=1 count=40 if=file1|hexdump) <(dd bs=1 count=40 if=file2|hexdump)
On Ubuntu 10.04.4 LTS, I did the following small test and got a surprising result:
First, I created a file with 5 lines and name it as a.txt:
echo -e "1\n2\n3\n4\n5" > a.txt
$ cat a.txt
1
2
3
4
5
Then I run wc to count the number of lines
$ wc -l a.txt
5 a.txt
However, when I run grep to count the number of lines that have line breaks I got an answer that I did not understand:
$ grep -c -P '\n' a.txt
3
My question is: how does grep get this number? Shouldn't it be 4?
Please Read The Fine Manual!
seq 1 5 | wc -l
5
seq 1 5 | grep -ac $'\n'
5
I don't understand where is the problem!?
seq 1 5 | hd
00000000 31 0a 32 0a 33 0a 34 0a 35 0a |1.2.3.4.5.|
Explanation:
-a switch tell grep to open file in binary mode. IE don't care about text formatting.
$'\n' syntax is resolved by bash himself, before running grep. Doing this give the ability to pass control characters as arguments to any command under bash.
Grep cannot see new line character. It searches for inline pattern.
Consider using grep -c -P '$' a.txt to match the ending of each line.
The newline character is not part of lines. grep uses the newline character as the record separator, and removes it from the lines, so that patterns with $ work as expected. For example, to search for lines ending with foo you can use the pattern foo$ instead of foo\n$. That would be very inconvenient.
So grep -c -P '\n' a.txt should give you 0. If you're getting 3, that sounds extremely strange, but perhaps it can be explained the highly experimental remark in man grep:
-P, --perl-regexp
Interpret PATTERN as a Perl regular expression (PCRE, see
below). This is highly experimental and grep -P may warn of
unimplemented features.
I'm in Debian/Wheezy, which is much more recent than Ubuntu 10.04. If the -P is "highly experimental" today, it's not too difficult to imagine it was buggy in older systems. This is just a guess though.
To count the number of newlines, use wc -l, not a grep -c hack.
Btw, interestingly:
$ printf hello >> a.txt
$ wc -l a.txt
5 a.txt
$ grep -c '' a.txt
6
That is, printf doesn't print a newline, so after we append "hello" to a.txt, there won't be a newline at the end of the file. So wc -l counts newline characters, not exactly "lines", and grep '' (empty string) matches all lines.
I think you want to use
$ grep -c -P "." a.txt
5
$ echo "6" >> a.txt
$ grep -c -P "." a.txt
6
$ cat a.txt
1
2
3
4
5
6
This question already has answers here:
How to create a hex dump of file containing only the hex characters without spaces in bash?
(9 answers)
Closed 7 years ago.
following Convert decimal to hexadecimal in UNIX shell script
I am trying to print only the hex values from hexdump, i.e. don't print the lines numbers and the ASCII table.
But the following command line doesn't print anything:
hexdump -n 50 -Cs 10 file.bin | awk '{for(i=NF-17; i>2; --i) print $i}'
Using xxd is better for this job:
xxd -p -l 50 -seek 10 file.bin
From man xxd:
xxd - make a hexdump or do the reverse.
-p | -ps | -postscript | -plain
output in postscript continuous hexdump style. Also known as plain hexdump style.
-l len | -len len
stop after writing <len> octets.
-seek offset
When used after -r: revert with <offset> added to file positions found in hexdump.
You can specify the exact format that you want hexdump to use for output, but it's a bit tricky. Here's the default output, minus the file offsets:
hexdump -e '16/1 "%02x " "\n"' file.bin
(To me, it looks like this would produce an extra trailing space at the end
of each line, but for some reason it doesn't.)
As an alternative, consider using xxd -p file.bin.
First of all, remove -C which is emitting the ascii information.
Then you could drop the offset with
hexdump -n 50 -s 10 file.bin | cut -c 9-