File name look the same but is different after copying - linux

My file names look the same but they are not.
I copied many_img/ from Debian1 to OS X, then from OS X to Debian2 (for maintenance purpose) with using rsync -a -e ssh on each step to preserve everything.
If i do ls many_img/img1/* i get visually the same output on Debian1 and Debian2 :
prévisionnel.jpg
But somehow, ls many_img/img1/* | od -c gives different results:
On Debian1:
0000000 p r 303 251 v i s i o n n e l . j p
0000020 g \n
On Debian2:
0000000 p r e 314 201 v i s i o n n e l . j
0000020 p g \n
Thus my web app on Debian2 cannot match the picture in the file system with filename in database.
i thought maybe i need to change file encoding, but it looks like it's already utf-8 on every OS:
convmv --notest -f iso-8859-15 -t utf8 many_img/img1/*
Returns:
Skipping, already UTF-8
Is there a command to get back all my 40 thousands file names like on my Debian 1 from my Debian 2 (without transfering all again) ?
I am confused if it is a file name encoding problem or anything else ?

I finaly found command line conversion tools i was looking for (thanks #Mark for setting me on the right track !)
Ok, i didn't know OS X was encoding file names under the hood with a different UTF-8 Normalization.
It appears OS X is using Unicode Normalization Form D (NFD)
while Linux OS are using Unicode Normalization Form C (NFC)
HSF+ file system encode every single file name character in UTF-16.
Unicode characters are Decomposed on OS X versus Precomposed on Linux OS.
é for instance (Latin small letter e with acute accent), is technically a (U+00E9) character on Linux
and is decomposed into a base letter "e" (U+0065) and an acute accent (U+0301) in its decomposed form (NFD) on OS X.
Now about conversion tools:
This command executed from Linux OS will convert file name from NFD
to NFC:
convmv --notest --nfc -f utf8 -t utf8 /path/to/my/file
This command executed from OS X will rsync over ssh with NFD to NDC
on the fly conversion:
rsync -a --iconv=utf-8-mac,utf-8 -e ssh path/to/my/local/directory/* user#destinationip:/remote/path/
I tested the two methods and it works like a charm.
Note:
--iconv option is only available with rsync V3 whereas OS X provides an old 2.6.9 version by default so you'll need to update it first.
Typically to check and upgrade :
rsync --version
brew install rsync
echo 'export PATH=/usr/local/bin:$PATH' >> ~/.profile

The first filename contains the single character é while the second contains a simple e followed by the combining character ́ (COMBINING ACUTE ACCENT). They're both valid Unicode, they're just normalized differently. It appears the OS normalized the filename as it created the file.

Related

How to convert text to UTF-8 encoding within a text file

When I use a text editor to see the actual content, I see
baliÄ<8d>ky 0 b a l i ch k i
and when I use cat to see it, I see
baličky 0 b a l i ch k i
How can I make it so, it is actually baličky in the text editor as well?
I've tried numerous commands such as iconv -f UTF-8 -t ISO-8859-15, iconv -f ISO-8859-15 -t UTF-8, recode utf8..l9.
None of them works. It's still baliÄ<8d>ky instead of baličky. This is a Czech word. If I do a simple sed command (/Ä<8d>/č), it works but I have so many other characters like this and manual work is basically really mundane at this point.
Any suggestions?

Use iconv or python3 to recode utf-8 to Latin-1 (ISO-8859-1) preserving accented characters

By most accounts, one ought to be able to change the encoding of a UTF-8
file to a Latin-1 (ISO-8859-1) encoding by a trivial invocation of iconv such as:
iconv -c -f utf-8 -t ISO-8859-1//TRANSLIT
However, this fails to deal with accented characters properly. Consider
for example:
$ echo $LC_ALL
C
$ cat Gonzalez.txt
González, M.
$ file Gonzalez.txt
Gonzalez.txt: UTF-8 Unicode text
$ iconv -c -f utf-8 -t ISO-8859-1//TRANSLIT < Gonzalez.txt > out
$ file out
out: ASCII text
$ cat out
Gonzalez, M.
I've tried various variations of the above, but none properly handles
the accented "a", the point being that Latin-1 does have an accented "a".
Indeed, uconv does handle the situation properly:
$ uconv -x Any-Accents -f utf-8 -t l1 < Gonzalez.txt > out
$ file out
out: ISO-8859 text
Opening the file in emacs or
Sublime shows the accented "a" properly. Same thing using -x nfc.
Unfortunately, my target environment does not permit a solution using "uconv",
so I am looking for a simple solution using either iconv or Python3.
python3 attempts
My attempts using python3 so far have not been successful.
For example, the following:
import sys
import fileinput # allows file to be specified or else reads from STDIN
for line in fileinput.input():
l=line.encode("latin-1","replace")
sys.stdout.buffer.write(l)
produces:
Gonza?lez, M.
(That's a literal "?".)
I've tried various other Python3 possibilities, so far without success.
Please note that I've reviewed numerous SO questions on this topic, but the answers using iconv or Python3 do not handle Gonzalez.txt properly.
There are two ways to encode A WITH ACUTE ACCENT in Unicode.
One is to use a combined character, as illustrated here with Python's built-in ascii function:
>>> ascii('á')
"'\\xe1'"
But you can also use a combining accent following an unaccented letter a:
>>> ascii('á')
"'a\\u0301'"
Depending on the displaying applications, the two variants may look indistinguishable (in my terminal, the latter looks a bit odd with the accent being too large).
Now, Latin-1 has an accented letter a, but no combining accents, so that's why the acute becomes a question mark when encoding with errors="replace".
Fortunately, you can automatically switch between the two variants.
Without going into details (there are many details here), Unicode defined two normalization forms, called composed and decomposed, abbreviated NFC and NFD, respectively.
In Python, you can use the standard-library module unicodedata:
>>> import unicodedata as ud
>>> ascii(ud.normalize('NFD', 'á'))
"'a\\u0301'"
>>> ascii(ud.normalize('NFC', 'á'))
"'\\xe1'"
In your specific case, you can convert the input strings to NFC form, which will increase coverage of Latin-1 characters:
>>> n = 'Gonza\u0301lez, M.'
>>> print(n)
González, M.
>>> n.encode('latin1', errors='replace')
b'Gonza?lez, M.'
>>> ud.normalize('NFC', n).encode('latin1', errors='replace')
b'Gonz\xe1lez, M.'

How to grep on the content of a zipped non-standard textfile

On my Windows-10 PC, I have installed Ubuntu app. There I'd like to grep on the content of a group of zipfiles, but let's start with just 1 zipfile. My zipfile contains two files: a crashdump and an errorlog (textfile), containing some information. I'm particularly interested in information within that error logfile:
<grep_inside> zipfile.zip "Access violation"
Until now, this is my best result:
unzip -c zipfile.zip error.log
This shows the error logfile, but it shows it as a hexdump, which makes it impossible to launch a grep on it.
As proposed on different websites, I've also tried following commands: vim, view, zcat, zless and zgrep, all are not working for different reasons.
Some further investigation
This question is not a duplicate of this post, a suggested, I believe the issue is caused by the encoding of the logfile, as you can see in following results of other basic Linux commands, after unzipping the error logfile:
emacs error.log
... caused an Access Violation (0xc0000005)
cat error.log
. . . c a u s e d a n A c c e s s V i o l a t i o n ( 0 x c 0 0 0 0 0 0 5 )
Apparently the error.log file is not being recognised as a simple textfile:
file error.log
error.log : Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators
In this post on grepping non-standard text files, I found the answer:
unzip -c zipfile.zip error.log | grep -a "A.c.c.e.s.s"
Now I have something to start from.
Thanks, everyone, for your cooperation.

What does "cat -A" command option mean in Unix

I'm currently working on a Unix box and came across this post which I found helpful, in order to learn about cat command in Unix. At the bottom of the page found this line saying: -A = Equivalent to -vET
As I'm new into Unix, I'm unaware of what does this mean actually? For example lets say I've created a file called new using cat and then apply this command to the file:
cat -A new, I tried this command but an error message comes up saying it's and illegal option.
To cut short, wanted to know what does cat -A really mean and how does it effect when I apply it to a file. Any help would be appreciated.
It means show ALL.
Basically its a combination of -vET
E : It will display '$' at the end of every line.
T : It will display tab character as ^I
v : It will use ^ and M-notation
^ and M-notation:
(Display control characters except for LFD(LineFeed or NewLine) and TAB using '^' notation and precede characters that have the high bit set with
'M-') M- notation is a way to display high-bit characters as low bit ones by preceding them with M-
You should read about little-endian and big-endian if you like to know more about M notation.
For example:
!http://i.imgur.com/0DGET5k.png?1
Check your manual page as below and it will list all options avaialable with your command and check is there -A present, if it is not present it is an illegal option.
man cat
It displays non-printing characters
In Mac OS you need to use -e flag and
-e Display non-printing characters (see the -v option), and display a dollar sign (`$') at the end of each line.

"grep" offset of ascii string from binary file

I'm generating binary data files that are simply a series of records concatenated together. Each record consists of a (binary) header followed by binary data. Within the binary header is an ascii string 80 characters long. Somewhere along the way, my process of writing the files got a little messed up and I'm trying to debug this problem by inspecting how long each record actually is.
This seems extremely related, but I don't understand perl, so I haven't been able to get the accepted answer there to work. The other answer points to bgrep which I've compiled, but it wants me to feed it a hex string and I'd rather just have a tool where I can give it the ascii string and it will find it in the binary data, print the string and the byte offset where it was found.
In other words, I'm looking for some tool which acts like this:
tool foobar filename
or
tool foobar < filename
and its output is something like this:
foobar:10
foobar:410
foobar:810
foobar:1210
...
e.g. the string which matched and a byte offset in the file where the match started. In this example case, I can infer that each record is 400 bytes long.
Other constraints:
ability to search by regex is cool, but I don't need it for this problem
My binary files are big (3.5Gb), so I'd like to avoid reading the whole file into memory if possible.
grep --byte-offset --only-matching --text foobar filename
The --byte-offset option prints the offset of each matching line.
The --only-matching option makes it print offset for each matching instance instead of each matching line.
The --text option makes grep treat the binary file as a text file.
You can shorten it to:
grep -oba foobar filename
It works in the GNU version of grep, which comes with linux by default. It won't work in BSD grep (which comes with Mac by default).
You could use strings for this:
strings -a -t x filename | grep foobar
Tested with GNU binutils.
For example, where in /bin/ls does --help occur:
strings -a -t x /bin/ls | grep -- --help
Output:
14938 Try `%s --help' for more information.
162f0 --help display this help and exit
I wanted to do the same task. Though strings | grep worked, I found gsar was the very tool I needed.
http://tjaberg.com/
The output looks like:
>gsar.exe -bic -sfoobar filename.bin
filename.bin: 0x34b5: AAA foobar BBB
filename.bin: 0x56a0: foobar DDD
filename.bin: 2 matches found

Resources