Finding the encoding of text files - linux

I have a bunch of text files with different encodings. But I want to convert all of the into utf-8. since there are about 1000 files, I cant do it manually. I know that there are some commands in llinux which change the encodings of files from one encoding into another one. but my question is how to automatically detect the current encoding of a file? Clearly I'm looking for a command (say FindEncoding($File) ) to do this:
foreach file
do
$encoding=FindEncoding($File);
uconv -f $encoding -t utf-8 $file;
done

I usually do sth like this:
for f in *.txt; do
encoding=$(file -i "$f" | sed "s/.*charset=\(.*\)$/\1/")
recode $encoding..utf-8 "$f"
done
Note that recode will overwrite the file for changing the character encoding.
If it is not possible to identify the text files by extension, their respective mime type can be determined with file -bi | cut -d ';' -f 1.
It is also probably a good idea to avoid unnecessary re-encodings by checking on UFT-8 first:
if [ ! "$encoding" = "utf-8" ]; then
#encode
After this treatment, there might still be some files with an us-ascii encoding. The reason for that is ASCII being a subset of UTF-8 that remains in use unless any characters are introduced that are not expressible by ASCII. In that case, the encoding switches to UTF-8.

Related

How to print only txt files on linux terminal?

On my Linux directory I have 6 files. 5 files are txt files and 1 file a .tar.gz type file. How can I print to the terminal only the name of the txt files?
directory :dir
content:
ex1, ex2, ex3, ex4, ex5, ex6.tar.gz
Because you do not have a file extension (.txt) I would try to do it with exclusion.
ls | grep -v tar.gz
If you have multiple types then use extensions.
The command 'file', followed by the name of a file, will return the type of the file.
You can loop over the files in your directory, use each filename as input to the 'file' command, and if it is a text file, print that filename.
The following includes some extra output from the file command, which I'm not sure how to remove yet, but it does give you the filenames you want:
#!/bin/bash
for f in *
do
file $f | grep text
done
You can put this into a shell script in the directory you want to get the filenames from, and run it from the command line.
The suggestions of using the file command are correct. The problem here is parsing the output of this command, because (1) file names can contain pretty any character, and (2) the concrete output of the file command is a bit unpredictable, because it depends on how the so called magic files are present.
If we rely on the fact that the explanation text of the output of the file command - i.e. that part which explains what file it is - always contains the word text if it is a text file, and that it never contains a colon, we can process it as follows:
The last colon in the output must separated the filename from the explanation. Everything to the left is the filename, and if the word text (note the leading space before text!) occurs in the right part, we have a text file.
This still leaves us with those (hopefully rare) cases where a file name contains a non-printable character, they would be translated to their octal equivalent, which might or might not be what you want to see. You can suppress this by passing the -r option to the file command. This is useful if you want to process this filename further instead of just displaying it to the user, but it might corrupt your parsing logic, especially if the filename contains a newline.
Finally, don't forget that in any case, you see what the system considers a text file. This is not necessarily the same what you define to be a text file.
Updated Answer
As #hek2mgl points out in the comments, a more robust solution is to separate filenames using nul characters (which may not occur in filenames) and that will deal with filenames containing newlines, and colons:
file -0 * | awk -F'\0' '$2 ~ /text/{print $1}'
Original Answer
I would do this:
file * | awk -F: '$2~/text/{print $1}'
That runs file to see the type of each file and passes the names and types to awk separated by a colon. awk then looks for the word text in the second field and if it finds it, prints the first field - which is the filename.
Try running the following simpler command on its own to see how it works:
file *
Given this directory of files:
$ file *
1.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators
2.pdf: PDF document, version 1.5
3.pdf: PDF document, version 1.5
4.dat: data
5.txt: ASCII text
6.jpg: JPEG image data, JFIF standard 1.02, aspect ratio, density 100x100, segment length 16, baseline, precision 8, 2833x972, frames 3
7.html: HTML document text, UTF-8 Unicode text, with very long lines, with no line terminators
8.js: UTF-8 Unicode text
9.xml: XML 1.0 document text
A.pl: a /opt/local/bin/perl script text executable, ASCII text
B.Makefile: makefile script text, ASCII text
C.c: c program text, ASCII text
D.docx: Microsoft Word 2007+
You can see the only files that are pure ascii are 5.txt, 9.xml, and A-C. The rest are either binary or UTF according to file.
You can use a Bash glob to loop through files and use file to test each file. This save having to parse the output of file for the file names but relies on file to accurate identify what you consider to be 'text':
for fn in *; do
[ -f "$fn" ] || continue
fo=$(file "$fn")
[[ $fo =~ ^"$fn":.*text ]] || continue
echo "$fn"
done
If you cannot use file, which is certainly the easiest way, you can open the file and look for binary characters. Use Perl for that:
for fn in *; do
[ -f "$fn" ] || continue
head -c 2000 "$fn" | perl -lne '$tot+=length; $cnt+=s/[^[:ascii:]]//g; END{exit 1 if($cnt/$tot>0.03);}'
[ $? -eq 0 ] || continue
echo "$fn"
done
In this case, I am looking for a percentage of ascii vs non ascii in the first 2000 bytes of a file. YMMV but that allows finding a file that file would report as UTF (since it has a binary BOM) but most of the file is ascii.
For that directory, the two Bash scripts report (with my comments on each file):
1.txt # UTF file with a binary BOM but no UTF characters -- all ascii
4.dat # text based configuration file for a router. file does not report this
5.txt # Pure ascii file
7.html # html file
8.js # Javascript sourcecode
9.xml # xml file all text
A.pl # Perl file
B.Makefile # Unix make file
C.c # C source file
Since file does not consider the all ascii file 4.dat to be text, it is not reported by the first Bash script but is by the second. Otherwise -- same output.

How to replace a string containing "\u2015"?

Does anyone know how to replace a string containing \u2015 in a SED command like the example below?
sed -ie "s/some text \u2015 some more text/new text/" inputFileName
You just need to escape the slashes present. Below example works fine in GNU sed version 4.2.1
$ echo "some text \u2015 some more text" | sed -e "s/some text \\\u2015 some more text/abc/"
$ abc
Also you don't have to use the -i flag which according to the the man page is only for editing files in-place.
-i[SUFFIX], --in-place[=SUFFIX]
edit files in place (makes backup if extension supplied). The default operation mode is to break symbolic and hard links. This can be changed with --follow-symlinks and
--copy.
Not sure if this is exactly what you need, but maybe you should take a look at native2ascii tool to convert such unicode escapes.
Normally it replaces all characters that cannot be displayed in ISO-8859-1 with their unicodes (escaped with \u), but it also supports reverse conversions. Assuming you have some file in UTF-8 named "input" containing \u00abSome \u2015 string\u00bb, then executing
native2ascii -encoding UTF-8 -reverse input output
will result in "output" file with «Some ― string».

Change all non-ascii chars to ascii Bash Scripting

I am trying to write a script that take people names as an arguments and create a folder with their names. But in folder names, the non-ascii chars and whitespaces can sometimes make problem so I want to remove or change them to ascii chars.
I can remove the whitespace between name and surname but I can not figure out how can I change ş->s, ç->c, ğ->g, ı->i, ö->o.
Here is my code :
#!/bin/bash
ARRAY=("$#")
ELEMENTS=${#ARRAY[#]}
for (( i=0;i<$ELEMENTS;i++))
do #C-like for loop syntax
echo ${ARRAY[$i]} | grep "[^ ]*\b" | tr -d ' '
done
I run my script like that myscript.sh 'Çişil Aksoy' 'Cem Dalgıç'
It should change the arguments like : CisilAksoy CemDalgic
Thanks in advance
EDIT :
I found this solution, this does not look very pretty but it works.
sed 's/ş/s/gI; s/ç/c/gI; s/ü/u/gI; s/ö/o/gI; s/ı/i/gI;'
EDIT2 : SOLVED
#!/bin/bash
ARRAY=("$#")
ELEMENTS=${#ARRAY[#]}
for (( i=0;i<$ELEMENTS;i++))
do #C-like for loop syntax
v=$(echo ${ARRAY[$i]} | grep "[^ ]*\b" | tr -d ' ' | sed 's/ş/s/gI; s/ç/c/gI; s/ü/u/gI; s/ö/o/gI; s/ı/i/gI;')
mkdir $v
done
Anything that converts from UTF-8 to ASCII is going to be a compromise.
The iconv program does what was requested (not necessarily satisfying everyone, as in Transliterate any convertible utf8 char into ascii equivalent). Given
Çişil Aksoy' 'Cem Dalgıç
in "foo.txt", and the command
iconv -f UTF8 -t ASCII//TRANSLIT <foo.txt
that would give
Cisil Aksoy' 'Cem Dalg?c
The lynx browser has a different set of ASCII approximations. Using this command
lynx -display_charset=us-ascii -force_html -nolist -dump foo.txt
I get this result:
C,isil Aksoy' 'Cem Dalgic,
Simply put, you can't. ASCII only supports 128 characters.
International characters typically use some variation of Unicode, which can store a much much greater number of characters.
I think your best bet is to identify WHY your folder creation fails when using these characters. Does the method or function not support Unicode? If it does, figure out how to specify that instead of ASCII. If not, you might be stuck with sed and/or tr, which is probably not sustainable.
[UPDATED]
You should be able to substitute multiple characters via tr like follows:
echo şğıö | tr şçğıö scgio
sgio
(I removed my comment from earlier. I tried it on a different server and it worked fine.)

Linux script to automatically convert file type to UTF8

I am in a tight spot and could use some help coming up with a linux shell script to convert a directory full of pipes delimited files from their original file encoding to UTF-8. The source files are either US-ASCII or ISO-8859-1 file encoding. The closest thing that I could come up with is:
iconv -f ISO8859-1 -t utf-8 * > name_of_utf8_file
This condenses all of the files into a single file which is not needed but OK for this application. The problem is that I neeed to specify both the source and destination file encoding, so for half of the files I don't know what it does. Is there way to write a shell script using commands like file -i or the like.
Any advice here is much appreciated.
This is, (not properly tested, caveat emptor :)), one way of doing it:
Maybe try w/ a small subset first - this is more of a thought example than a turn-key solution.
for i in *
do
if $( file -i "${i}"|grep -q us-ascii ); then
iconv -f us-ascii -t utf-8 "$i" > "${i}.utf8"
fi
if $( file -i "${i}"|grep -q iso-8859-1 ); then
iconv -f iso8859-1 -t utf-8 "$i" > "${i}.utf8"
fi
done

Find and replace Non breaking space characters in Bash

I have a document with some special characters like non-breaking space, non-breaking hyphen, and so on. I want to normalize this document and replace these special characters with space. In addition since the content of this document is gathered from different resources, I have different forms of "Yeh" (ی) in it, and I want to normalize them.
Is it possible to find and replace unicode characters in a document using sed command? Can I use Unicode codes instead of surface form of the character? for example can I use x00a0 instead of non-breaking space in sed command? How?
Sorry for bad explanation.
My documents are encoded in UTF8, and contain non-English characters. for example I have a document in Arabic, a document in Urdu, and one in Persian (Farsi). now I want to replace some of the characters in these files by another character.
By normalizing, I mean that I want to replace all forms of "Yeh" into one form. (As you might now, there are many forms of this character which is used in Arabic, but for simplification and some processing issues I want to unify all these forms.
To process UTF-8 files, you have to parse each characters from begin to end. If you need to do it efficiently, you have to write a real program rather then trying to script a solution.
If you just want to script it, it is easier to convert it to UTF-16 and then process the characters.
A fairly inefficient way would be:
#!/bin/bash
function px {
local a="$#"
local i=0
while [ $i -lt ${#a} ]
do
printf \\x${a:$i:2}
i=$(($i+2))
done
}
(iconv -f UTF8 -t UTF16 | od -x | cut -b 9- | xargs -n 1) |
if read utf16header
then
px $utf16header
out=''
while read line
do
if [ "$line" == "000a" ]
then
out=$out$line
px $out
out=''
else
# put your coversion logic here.
# e.g
# if [ "$line" == "0031" ] ; then
# line="0041"
# fi
out=$out$line
fi
done
fi | iconv -f UTF16 -t UTF8
This might work for you (GNU sed):
echo abcd | sed 'p;y/\x61\x62\x63/ABC/'
abcd
ABCd

Resources