LF --> CR/LF conversion for UTF-16 file - linux

I have an UTF-16 encoded file and I want replace UNIX line endings with Windows line endings. I don't want to touch anything else.
Is there a linux command line tool that can search for two bytes "0A 00" and replace it with four bytes "0D 00 0A 00"?

Perl to the rescue:
perl -we 'binmode STDIN, ":encoding(UTF-16le)";
binmode STDOUT, ":encoding(UTF-16le):crlf";
print while <STDIN>;
' < input.txt > output.txt

You may use unix2dos, but you have to convert the file to a 8-bit encoding before, and back to UTF-16 after. The obvious intermediate candidate is UTF-8:
$ cat in.txt | iconv -f UTF-16 -t UTF-8 | unix2dos | iconv -f UTF-8 -t UTF-16 > out.txt
You can wrap these three piped commands in a handy script, if you wish.
#/bin/sh
iconv -f UTF-16 -t UTF-8 | unix2dos | iconv -f UTF-8 -t UTF-16

unix2dos is what you're looking for. See its different options to find the one that's right for your UTF-16 encoding.

Solution:
perl -pe "BEGIN { binmode $_, ':raw:encoding(UTF-16LE)' for *STDIN, *STDOUT }; s/\n\0/\r\0\n\0/g;" < input.file > output.file
Credit to my coworker Manu and Stream-process UTF-16 file with BOM and Unix line endings in Windows perl

Related

Hexdump reverse command

The hexdump command converts any file to hex values.
But what if I have hex values and I want to reverse the process, is this possible?
There is a similar tool called xxd. If you run xxd with just a file name it dumps the data in a fairly standard hex dump format:
# xxd bdata
0000000: 0001 0203 0405
......
Now if you pipe the output back to xxd with the -r option and redirect that to a new file, you can convert the hex dump back to binary:
# xxd bdata | xxd -r >bdata2
# cmp bdata bdata2
# xxd bdata2
0000000: 0001 0203 0405
I've written a short AWK script which reverses hexdump -C output back to the
original data. Use like this:
reverse-hexdump.sh hex.txt > data
Handles '*' repeat markers and generating original data even if binary.
hexdump -C and reverse-hexdump.sh make a data round-trip pair. It is
available here:
GitHub reverse-hexdump repo
Direct to reverse-hexdump.sh
Restore file, given only the output of hexdump file
If you only have the output of hexdump file and want to restore the original file, first note that hexdump's default output depends on the endianness of the system you ran hexdump on!
If you have access to the system that created the dump, you can determinate its endianness using below command:
[[ "$(printf '\01\03' | hexdump)" == *0103* ]] && echo big || echo little
Reversing little-endian hexdump
This is the most common case. All x86/x64 systems are little-endian. If you don't know the endianness of the system that ran hexdump file, try this.
sed 's/ \(..\)\(..\)/ \2\1/g;$d' dump | xxd -r
The sed part converts hexdump's format into xxd's format, at least so far that xxd -r works.
Reversing big-endian hexdump
sed '$d' dump | xxd -r
Known Bugs (see comment section)
A trailing null byte is added if the original file was of odd length (e.g. 1, 3, 5, 7, ..., byte long).
Repeating sections of the original file are not restored correctly if they were hexdumped using a *.
You can check your dump for above problematic cases by running below command:
grep -qE '^\*|^[0-9a-f]*[13579bdf] *$' dump && echo bug || echo ok
Better alternative to create hexdumps in the first place
Besides the non-posix (and therefore not so portable) xxd there is od (octal dump) which should be available on all unix-like systems as it is specified by posix:
od -tx1 -An -v
Will print a hexadecimal dump, grouping digits as single bytes (-tx1), with no Address prefixes (-An, similar to xxd -p) and without abbreviating repeated sections as * (-v). You can reverse such a dump using xxd -r -p.
As someone who sucks at bash, I could not understand the examples already posted.
Here is what would have helped me when I was originally searching:
Take your text file "AYE.TXT" and convert it into a hex dump called "BEE.TXT"
xxd -p "AYE.TXT" > "BEE.TXT"
Take your hex dump file ("BEE.TXT") and covert it back to ascii file "CEE.TXT"
xxd -r -p "BEE.TXT" > "CEE.TXT"
Now that you have some simple working code, feel free to check out
"xxd -help" on the command line for an explanation of what all those flags do.
(That part is the easy part, the hard part is the specifics of the bash syntax)
There is a tonne of more elegant ways to get this done, but I've quickly hacked something together that Works for Me (tm) when regenerating a binary file from a hex dump generated by hexdump -C some_file.bin:
sed 's/\(.\{8\}\) \(..\) \(..\) \(..\) \(..\) \(..\) \(..\) \(..\) \(..\)/\1: \2\3 \4\5 \6\7 \8\9/g' some_file.hexdump | sed 's/\(.*\) \(..\) \(..\) \(..\) \(..\) \(..\) \(..\) \(..\) \(..\) |/\1 \2\3 \4\5 \6\7 \8\9 /g' | sed 's/.$//g' | xxd -r > some_file.restored
Basically, uses 2 sed processeses, each handling it's part of each line. Ugly, but someone might find it useful.
If you don't have xxd, use hexdump, od, perl or python:
The following all give the same output:
# If you only have hexdump
hexdump -ve '1/1 "%.2x"' mybinaryfile > mydump
# This gives exactly the same output as:
xxd -p mybinaryfile > mydump
# Or, much slower:
od -v -t x1 -An < mybinaryfile | tr -d "\n " > mydump
# Or, the fastest:
perl -pe 'BEGIN{$/=\1e6} $_=unpack "H*"' < mybinaryfile > mydump
# Or, if you somehow have Python, and not Perl:
python -c "print(open('mybinaryfile','rb').read().hex())" > mydump
Then you can copy and paste, or pipe the output, and convert back with:
xxd -r -p mydump mybinaryfileagain
# Or
xxd -r -p < mydump > mybinaryfileagain
The hexdump command is available almost everywhere, and is usually part of the default busybox - if it's not linked, you can try running busybox hexdump or busybox xxd.
If xxd is not available to reverse the data, then you can try awk
The old days: Zmodem
In the old days we used to use X/Y/Zmodem which is available in the package lrzsz which can tolerate lossy comms - but it's a bidirectional protocol so the binaries need to be running at the same time and there needs to be bidirectional comms:
# Demo on local machine, using FIFOs
mkfifo /tmp/fifo-in
mkfifo /tmp/fifo-out
sz -b mybinaryfile > /tmp/fifo-out < /tmp/fifo-in
mkdir out; cd out
rz -b < /tmp/fifo-out > /tmp/fifo-in
Luckily, screen supports receiving Zmodem, so if you're in a screen session:
screen
telnet somehost
Then type Ctrl+A and : and then zmodem catch and Enter. Then inside the screen on the remote host, issue:
# sz -b mybinaryfile
Press Enter when you see the string starting with "!!!".
When you see "Transfer Complete", you may want to run reset if you want to continue the terminal session normally.
This program reverses hexdump -C output back to the original data.
Usage:
make
make test
./unhexdump -i inputfile -o outputfile
see https://github.com/zhouzq-thu/unhexdump!
i found more simple solution:
bin2hex
echo -n "abc" | hexdump -ve '1/1 "%02x"'
hex2bin
echo -n "616263" | grep -Eo ".{2}" | sed 's/\(.*\)/\\x\1/' | tr -d '\n' | xargs -0 echo -ne

How to check if a file contains only zeros in a Linux shell?

How to check if a large file contains only zero bytes ('\0') in Linux using a shell command? I can write a small program for this but this seems to be an overkill.
If you're using bash, you can use read -n 1 to exit early if a non-NUL character has been found:
<your_file tr -d '\0' | read -n 1 || echo "All zeroes."
where you substitute the actual filename for your_file.
The "file" /dev/zero returns a sequence of zero bytes on read, so a cmp file /dev/zero should give essentially what you want (reporting the first different byte just beyond the length of file).
If you have Bash,
cmp file <(tr -dc '\000' <file)
If you don't have Bash, the following should be POSIX (but I guess there may be legacy versions of cmp which are not comfortable with reading standard input):
tr -dc '\000' <file | cmp - file
Perhaps more economically, assuming your grep can read arbitrary binary data,
tr -d '\000' <file | grep -q -m 1 ^ || echo All zeros
I suppose you could tweak the last example even further with a dd pipe to truncate any output from tr after one block of data (in case there are very long sequences without newlines), or even down to one byte. Or maybe just force there to be newlines.
tr -d '\000' <file | tr -c '\000' '\n' | grep -q -m 1 ^ || echo All zeros
It won't win a prize for elegance, but:
xxd -p file | grep -qEv '^(00)*$'
xxd -p prints a file in the following way:
23696e636c756465203c6572726e6f2e683e0a23696e636c756465203c73
7464696f2e683e0a23696e636c756465203c7374646c69622e683e0a2369
6e636c756465203c737472696e672e683e0a0a766f696420757361676528
63686172202a70726f676e616d65290a7b0a09667072696e746628737464
6572722c202255736167653a202573203c
So we grep to see if there is a line that is not made completely out of 0's, which means there is a char different to '\0' in the file. If not, the file is made completely out of zero-chars.
(The return code signals which one happened, I assumed you wanted it for a script. If not, tell me and I'll write something else)
EDIT: added -E for grouping and -q to discard output.
Straightforward:
if [ -n $(tr -d '\0000' < file | head -c 1) ]; then
echo a nonzero byte
fi
The tr -d removes all null bytes. If there are any left, the if [ -n sees a nonempty string.
Completely changed my answer based on the reply here
Try
perl -0777ne'print /^\x00+$/ ? "yes" : "no"' file

How to grep non-keyboard charaters?

I'm trying to use grep to get all μs under a directory, unfortunately, μ is not a keyboard character, any ideas?
BTW, for normal keyboard words, I could use
find / -type f -print | xargs grep -inE <search_word> 2>/dev/null
to find out all plain text files that contain the search word.
Would You mind using sed instead of grep?
sed -n '/\xb5/p'
However grep should also work:
grep -P '\xb5'
In Bash, you can use the shell's quoting facilities to pass non-ASCII content. In order to correctly identify the search string, we need to know the encoding of the files you are grepping. If they are in UTF-8, you need a different search string than if they are in ISO-8859-1 or UTF-16.
If your shell's locale agrees with the contents of the file, this should all work undramatically out of the box, but here are a couple of workarounds.
# grep ISO-8859-1 \xB5
grep $'\xB5' file
# grep UTF-8 U+03BC
grep $'\xCE\xBC' file
# grep UTF-16be U+03BC
grep $'\x03\xBC' file
# grep UTF-16le U+03BC
grep $'\xBC\x03' file
Some older versions of grep have a problem with non-ASCII characters; as a workaround, you can also use Perl.
perl -ne 'print if m/\u03BC/' file
You might have to play around with Perl's Unicode facilities to get this to work.

Using iconv to convert from UTF-16LE to UTF-8

Hi I am trying to convert some log files from a Microsoft SQL server, but the files are encoded using UTf-16LE and iconv does not seem to be able to convert them.
I am doing:
iconv -f UTF-16LE -t UTF-8 <filename>
I also tried to delete any carriage returns from the end of the line if there are any, but that did not fix it either. If I save it using gedit that works, but this is not a viable solution since I have hundreds of those files.
EDIT: Please see the new answer for the missing option
I forgot the -o switch!
The final command is :
iconv -f UTF-16LE -t UTF-8 <filename> -o <new-filename>
The command you specified will output to stdout. You can either use the -o parameter, or redirect your output:
with -o:
iconv -f UTF-16LE -t UTF-8 infile -o outfile
with piping:
iconv -f UTF-16LE -t UTF-8 infile > outfile
Both will yield the desired result.
However some versions of iconv (v1 on macOS for example) do not support the -o parameter and you will see that the converted text is echoed to stdout. In that case, use the piping option.

How to print only the hex values from hexdump without the line numbers or the ASCII table? [duplicate]

This question already has answers here:
How to create a hex dump of file containing only the hex characters without spaces in bash?
(9 answers)
Closed 7 years ago.
following Convert decimal to hexadecimal in UNIX shell script
I am trying to print only the hex values from hexdump, i.e. don't print the lines numbers and the ASCII table.
But the following command line doesn't print anything:
hexdump -n 50 -Cs 10 file.bin | awk '{for(i=NF-17; i>2; --i) print $i}'
Using xxd is better for this job:
xxd -p -l 50 -seek 10 file.bin
From man xxd:
xxd - make a hexdump or do the reverse.
-p | -ps | -postscript | -plain
output in postscript continuous hexdump style. Also known as plain hexdump style.
-l len | -len len
stop after writing <len> octets.
-seek offset
When used after -r: revert with <offset> added to file positions found in hexdump.
You can specify the exact format that you want hexdump to use for output, but it's a bit tricky. Here's the default output, minus the file offsets:
hexdump -e '16/1 "%02x " "\n"' file.bin
(To me, it looks like this would produce an extra trailing space at the end
of each line, but for some reason it doesn't.)
As an alternative, consider using xxd -p file.bin.
First of all, remove -C which is emitting the ascii information.
Then you could drop the offset with
hexdump -n 50 -s 10 file.bin | cut -c 9-

Resources