How to convert ISO8859-15 to UTF8? - linux

I have an Arabic file encoded in ISO8859-15. How can I convert it into UTF8?
I used iconv but it doesn't work for me.
iconv -f ISO-8859-15 -t UTF-8 Myfile.txt
I wanted to attach the file, but I don't know how.

Could it be that your file is not ISO-8859-15 encoded? You should be able to check with the file command:file YourFile.txt
Also, you can use iconv without providing the encoding of the original file:iconv -t UTF-8 YourFile.txt

I found this to work for me:
iconv -f ISO-8859-14 Agreement.txt -t UTF-8 -o agreement.txt

I have ubuntu 14 and the other answers where no working for me
iconv -f ISO-8859-1 -t UTF-8 in.tex > out.tex
I found this command here

We have this problem and to solve
Create a script file called to-utf8.sh
#!/bin/bash
TO="UTF-8"; FILE=$1
FROM=$(file -i $FILE | cut -d'=' -f2)
if [[ $FROM = "binary" ]]; then
echo "Skipping binary $FILE..."
exit 0
fi
iconv -f $FROM -t $TO -o $FILE.tmp $FILE; ERROR=$?
if [[ $ERROR -eq 0 ]]; then
echo "Converting $FILE..."
mv -f $FILE.tmp $FILE
else
echo "Error on $FILE"
fi
Set the executable bit
chmod +x to-utf8.sh
Do a conversion
./to-utf8.sh MyFile.txt
If you want to convert all files under a folder, do
find /your/folder/here | xargs -n 1 ./to-utf8.sh
Hope it's help.

I got the same problem, but i find the answer in this page! it works for me, you can try it.
iconv -f cp936 -t utf-8 

in my case, the file command tells a wrong encoding, so i tried converting with all the possible encodings, and found out the right one.
execute this script and check the result file.
for i in `iconv -l`
do
echo $i
iconv -f $i -t UTF-8 yourfile | grep "hint to tell converted success or not"
done &>/tmp/converted

You can use ISO-8859-9 encoding:
iconv -f ISO-8859-9 Agreement.txt -t UTF-8 -o agreement.txt

Iconv just writes the converted text to stdout. You have to use -o OUTPUTFILE.txt as an parameter or write stdout to a file. (iconv -f x -t z filename.txt > OUTPUTFILE.txt or iconv -f x -t z < filename.txt > OUTPUTFILE.txt in some iconv versions)
Synopsis
iconv -f encoding -t encoding inputfile
Description
The iconv program converts the encoding of characters in inputfile from one coded character set to another.
**The result is written to standard output unless otherwise specified by the --output option.**
--from-code, -f encoding
Convert characters from encoding
--to-code, -t encoding
Convert characters to encoding
--list
List known coded character sets
--output, -o file
Specify output file (instead of stdout)
--verbose
Print progress information.

Related

Bash script to convert all files of a given type from Unix to Dos Format

I need to be able to go back and forth between CentOS and Windows for C++ projects that I am developing on both systems. I want to be able to convert all files of a given type such as *.cpp, *.h or *.txt from Unix to Dos.
I have a bash script that converts a single file, but I'd like to be able to type in either just the extension of the file type or *.FILETYPE to convert all the files in a given directory of that type.
Here is the working script, what do I need to do to make it work for more than one file at a time. I've tried using a for loop but I get syntax errors.
#! /bin/sh
# Convert a text based file from Unix Format to Dos Format
# The Output file should be the name of the input file with
# WF inserted before the file extention
#
# Examples
# convert Unix format abc.txt to abcWF.txt Windows Format
# convert Unix format abc.cpp to abcWF.cpp Windows Format
echo "In u2w FileToConvert = $1"
FileBase=$(basename $1)
FileExtension="${FileBase##*.}"
FileBase="${FileBase%.*}"
OutputSuffix="WF.$FileExtension"
OutputFile="$FileBase$OutputSuffix"
commandment="unix2dos -n $1 $OutputFile"
echo $commandment
unix2dos -n $1 $OutputFile
echo "diff -w $1 $OutputFile"
diff -w "$1" "$OutputFile"
exit
Example run
$ bd2w TestResults.txt
In u2w FileToConvert = TestResults.txt
unix2dos -n TestResults.txt TestResultsWF.txt
unix2dos: converting file TestResults.txt to file TestResultsWF.txt in DOS format ...
diff -w TestResults.txt TestResultsWF.txt
Wrap the script in a for file in "$#" loop and replace all mentions of $1 with $file.
#!/bin/sh
for file in "$#"; do
echo "In u2w FileToConvert = $file"
FileBase=$(basename "$file")
FileExtension="${FileBase##*.}"
FileBase="${FileBase%.*}"
OutputSuffix="WF.$FileExtension"
OutputFile="$FileBase$OutputSuffix"
echo "unix2dos -n $file $OutputFile"
unix2dos -n "$file" "$OutputFile"
echo "diff -w $file $OutputFile"
diff -w "$file" "$OutputFile"
done

Bash, wget remove comma from output filename

I am reading a file with URL's by line by line then I pass URL to the wget:
FILE=/home/img-url.txt
while read line; do
url=$line
wget -N -P /home/img/ $url
done < $FILE
This works, but some file contains comma in the filename. How I can save the file without the comma?
Example:
http://xy.com/0005.jpg -> saved as 0005.jpg
http://xy.com/0022,22.jpg -> save as 002222.jpg not as 0022,22
I hope you find my question interesting.
UPDATE:
We have some nice solution, but is there any solution to the time stamping error?
WARNING: timestamping does nothing in combination with -O. See the manual
for details.
This should work:
url="$line"
filename="${url##*/}"
filename="${filename//,/}"
wget -P /home/img/ "$url" -O "$filename"
Using -N and -O both will throw a warning message. wget manual says:
-N (for timestamp-checking) is not supported in
combination with -O: since file is always newly created, it will always
have a very new timestamp.
So, when you use -O option, it actually creates a new file with new timestamping and thus the -N option becomes dummy (it can't do what it is for). If you want to preserve the timestapming, then a workaround might be this:
url="$line"
wget -N -P /home/img/ "$url"
file="${url##*/}"
newfile="${filename//,/}"
[[ $file != $newfile ]] && cp -p /home/img/"$file" /home/img/"$newfile" && rm /home/img/"$file"
In the body of the loop you need to generate the filename from the URL without commas and, without the leading part of the URL, and tell wget to save under other name.
url=$line
file=`echo $url | sed -e 's|^.*/||' -e 's/,//g'`
wget -N -P /home/image/dema-ktlg/ -O $file $url
In the meantime I wrote this:
url=$line
$file=`echo ${url##*/} | sed 's/,//'`
wget -N -P /home/image/dema-ktlg/ -O $file $url
Seems to work fine, is there any trivial problem with my code?

Shell file parsing to automate youtube uploads

I was writing a small script to mass upload videos to youtube.. It looks like this:
#! /bin/sh
python --version
for file in ./*.mp4 ; do
export title=$(basename "$file" ".mp4")
echo $title "for" $file
youtube-upload -m mail#mailer.com -p pass -c Category -t "$title" -d "description" "$file"
done
Where youtube-upload is a cli based python uploader..
My question is as you can see i'm taking the title from the file name path, and the description is always the same :(. I want to write the description into a text or xml file parse it, and then upload with a proper description for each file..
How to do this using shell commands in this context?
Thanks for your suggestions
Have one file containing all of your filename->description mappings:
my file.mp4 this is a description
file2.mp4 this is another description
my last file.mp4 this is a description
ilied.mp4 description
And then just grab the line beginning with the filename and use the rest of the line:
#! /bin/sh
python --version
for file in ./*.mp4 ; do
export title=$(basename "$file" ".mp4")
echo $title "for" $file
youtube-upload -m mail#mailer.com -p pass -c Category -t "$title" -d "$(grep "^$(basename "$file")" desc | sed 's/.*.mp4 //')" "$file"
done
Looking at what's in the $(...):
grep "^$(basename "$file")" desc | sed 's/.*.mp4 //'
The grep finds the line that starts with the basename($file) in the "desc" file and then uses sed to get rid of the filename, leaving the rest of the line (which would be the description).
Note that this doesn't need to be a part of the youtube-upload comand line. You can also toss it into a variable:
#! /bin/sh
python --version
for file in ./*.mp4 ; do
export title=$(basename "$file" ".mp4")
echo $title "for" $file
description=$(grep "^$(basename "$file")" desc | sed 's/.*.mp4 //')
youtube-upload -m mail#mailer.com -p pass -c Category -t "$title" -d "$description" "$file"
done

iconv any encoding to UTF-8

I am trying to point iconv to a directory and all files will be converted UTF-8 regardless of the current encoding
I am using this script but you have to specify what encoding you are going FROM. How can I make it autdetect the current encoding?
dir_iconv.sh
#!/bin/bash
ICONVBIN='/usr/bin/iconv' # path to iconv binary
if [ $# -lt 3 ]
then
echo "$0 dir from_charset to_charset"
exit
fi
for f in $1/*
do
if test -f $f
then
echo -e "\nConverting $f"
/bin/mv $f $f.old
$ICONVBIN -f $2 -t $3 $f.old > $f
else
echo -e "\nSkipping $f - not a regular file";
fi
done
terminal line
sudo convert/dir_iconv.sh convert/books CURRENT_ENCODING utf8
Maybe you are looking for enca:
Enca is an Extremely Naive Charset Analyser. It detects character set and encoding of text files and can also convert them to other encodings using either a built-in converter or external libraries and tools like libiconv, librecode, or cstocs.
Currently it supports Belarusian, Bulgarian, Croatian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Russian, Slovak, Slovene, Ukrainian, Chinese, and some multibyte encodings independently on language.
Note that in general, autodetection of current encoding is a difficult process (the same byte sequence can be correct text in multiple encodings). enca uses heuristics based on the language you tell it to detect (to limit the number of encodings). You can use enconv to convert text files to a single encoding.
You can get what you need using standard gnu utils file and awk. Example:
file -bi .xsession-errors
gives me:
"text/plain; charset=us-ascii"
so file -bi .xsession-errors |awk -F "=" '{print $2}'
gives me
"us-ascii"
I use it in scripts like so:
CHARSET="$(file -bi "$i"|awk -F "=" '{print $2}')"
if [ "$CHARSET" != utf-8 ]; then
iconv -f "$CHARSET" -t utf8 "$i" -o outfile
fi
Compiling all them. Go to dir, create dir2utf8.sh:
#!/bin/bash
# converting all files in a dir to utf8
for f in *
do
if test -f $f then
echo -e "\nConverting $f"
CHARSET="$(file -bi "$f"|awk -F "=" '{print $2}')"
if [ "$CHARSET" != utf-8 ]; then
iconv -f "$CHARSET" -t utf8 "$f" -o "$f"
fi
else
echo -e "\nSkipping $f - it's a regular file";
fi
done
Here is my solution to in place all files using recode and uchardet:
#!/bin/bash
apt-get -y install recode uchardet > /dev/null
find "$1" -type f | while read FFN # 'dir' should be changed...
do
encoding=$(uchardet "$FFN")
echo "$FFN: $encoding"
enc=`echo $encoding | sed 's#^x-mac-#mac#'`
set +x
recode $enc..UTF-8 "$FFN"
done
put it into convert-dir-to-utf8.sh and run:
bash convert-dir-to-utf8.sh /pat/to/my/trash/dir
Note that sed is a workaround for mac encodings here.
Many uncommon encodings need workarounds like this.
First answer
#!/bin/bash
find "<YOUR_FOLDER_PATH>" -name '*' -type f -exec grep -Iq . {} \; -print0 |
while IFS= read -r -d $'\0' LINE_FILE; do
CHARSET=$(uchardet $LINE_FILE)
echo "Converting ($CHARSET) $LINE_FILE"
# NOTE: Convert/reconvert to utf8. By Questor
iconv -f "$CHARSET" -t utf8 "$LINE_FILE" -o "$LINE_FILE"
# NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
# [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
# https://stackoverflow.com/a/45240995/3223785 ]
sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"
done
# [Refs.: https://justrocketscience.com/post/handle-encodings ,
# https://stackoverflow.com/a/9612232/3223785 ,
# https://stackoverflow.com/a/13659891/3223785 ]
FURTHER QUESTION: I do not know if my approach is the safest. I say this because I noticed that some files are not correctly converted (characters will be lost) or are "truncated". I suspect that this has to do with the "iconv" tool or with the charset information obtained with the "uchardet" tool. I was curious about the solution presented by #demofly because it could be safer.
Another answer
Based on #demofly 's answer:
#!/bin/bash
find "<YOUR_FOLDER_PATH>" -name '*' -type f -exec grep -Iq . {} \; -print0 |
while IFS= read -r -d $'\0' LINE_FILE; do
CHARSET=$(uchardet $LINE_FILE)
REENCSED=`echo $CHARSET | sed 's#^x-mac-#mac#'`
echo "\"$CHARSET\" \"$LINE_FILE\""
# NOTE: Convert/reconvert to utf8. By Questor
recode $REENCSED..UTF-8 "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP
STDERR_OP=$(cat STDERR_OP)
rm -f STDERR_OP
if [ -n "$STDERR_OP" ] ; then
# NOTE: Convert/reconvert to utf8. By Questor
iconv -f "$CHARSET" -t utf8 "$LINE_FILE" -o "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP
STDERR_OP=$(cat STDERR_OP)
rm -f STDERR_OP
fi
# NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
# [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
# https://stackoverflow.com/a/45240995/3223785 ]
sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"
if [ -n "$STDERR_OP" ] ; then
echo "ERROR: \"$STDERR_OP\""
fi
STDOUT_OP=$(cat STDOUT_OP)
rm -f STDOUT_OP
if [ -n "$STDOUT_OP" ] ; then
echo "RESULT: \"$STDOUT_OP\""
fi
done
# [Refs.: https://justrocketscience.com/post/handle-encodings ,
# https://stackoverflow.com/a/9612232/3223785 ,
# https://stackoverflow.com/a/13659891/3223785 ]
Third answer
Hybrid solution with recode and vim:
#!/bin/bash
find "<YOUR_FOLDER_PATH>" -name '*' -type f -exec grep -Iq . {} \; -print0 |
while IFS= read -r -d $'\0' LINE_FILE; do
CHARSET=$(uchardet $LINE_FILE)
REENCSED=`echo $CHARSET | sed 's#^x-mac-#mac#'`
echo "\"$CHARSET\" \"$LINE_FILE\""
# NOTE: Convert/reconvert to utf8. By Questor
recode $REENCSED..UTF-8 "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP
STDERR_OP=$(cat STDERR_OP)
rm -f STDERR_OP
if [ -n "$STDERR_OP" ] ; then
# NOTE: Convert/reconvert to utf8. By Questor
bash -c "</dev/tty vim -u NONE +\"set binary | set noeol | set nobomb | set encoding=utf-8 | set fileencoding=utf-8 | wq\" \"$LINE_FILE\""
else
# NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
# [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
# https://stackoverflow.com/a/45240995/3223785 ]
sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"
fi
done
This was the solution with the highest number of perfect conversions. Additionally, we did not have any truncated files.
WARNING: Make a backup of your files and use a merge tool to check/compare the changes. Problems probably will appear!
TIP: The command sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE" can be executed after a preliminary comparison with the merge tool after a conversion without it since it can cause "differences".
NOTE: The search using find brings all non-binary files from the given path ("") and its subfolders.
Check out tools available for a data convertation in a linux cli: https://www.debian.org/doc/manuals/debian-reference/ch11.en.html
Also, there is a quest to find out a full list of encodings which are available in iconv. Just run iconv --list and find out that encoding names differs from names returned by uchardet tool (for example: x-mac-cyrillic in uchardet vs. mac-cyrillic in iconv)
enca command doesn't work for my Simplified-Chinese text file with GB2312 encoding.
Instead, I use the following function to convert the text file for me.
You could of course re-direct the output into a file.
It requires chardet and iconv commands.
detection_cat ()
{
DET_OUT=$(chardet $1);
ENC=$(echo $DET_OUT | sed "s|^.*: \(.*\) (confid.*$|\1|");
iconv -f $ENC $1
}
use iconv and uchardet (thx farseerfc)
fish shell
cat your_file | iconv -f (uchardet your_file ) -t UTF-8
bash shell
cat your_file | iconv -f $(uchardet your_file ) -t UTF-8
if use bash script
#!/usr/bin/bash
for fn in "$#"
do
iconv < "$fn" -f $(uchardet "$fn") -t utf8
done
by #flowinglight at ubuntu group.

Why does iconv exit with error when reading from SVN transaction and not from SVN revision?

%% Problem solved and the code below acts as expected %%
I'm trying to write a SVN pre-commit hook in Bash that tests incoming files for UTF-8 encoding. After a lot of string juggling to get the path of the incoming files and ignore dirs/pictures/deleted files and so on, I use 'svnlook cat' to read the incoming file and pipe it to 'iconv -f UTF-8'. After this I read the exit status of the iconv operation with ${PIPESTATUS[1]}.
My code look like this:
REPOS="$1"
TXN="$2"
SVNLOOK=/usr/bin/svnlook
ICONV=/usr/bin/iconv
# The file endings to ignore when checking for UTF-8:
IGNORED_ENDINGS=( png jar )
# Prepairing to set the IFS (Internal Field Separator) so "for CHANGE in ..." will iterate
# over lines instead of words
OIFS="${IFS}"
NIFS=$'\n'
# Make sure that all files to be committed are encoded in UTF-8
IFS="${NIFS}"
for CHANGE in $($SVNLOOK changed -t "$TXN" "$REPOS"); do
IFS="${OIFS}"
# Skip change if first character is "D" (we dont care about checking deleted files)
if [ "${CHANGE:0:1}" == "D" ]; then
continue
fi
# Skip change if it is a directory (directories don't have encoding)
if [ "${CHANGE:(-1)}" == "/" ]; then
continue
fi
# Extract file repository path (remove first 4 characters)
FILEPATH=${CHANGE:4:(${#CHANGE}-4)}
# Ignore files that starts with "." like ".classpath"
IFS="//" # Change seperator to "/" so we can find the file in the file path
for SPLIT in $FILEPATH
do
FILE=$SPLIT
done
if [ "${FILE:0:1}" == "." ]; then
continue
fi
IFS="${OIFS}" # Reset Internal Field Seperator
# Ignore files that are not supposed to be checked, like images. (list defined in IGNORED_ENDINGS field above)
IFS="." # Change seperator to "." so we can find the file ending
for SPLIT in $FILE
do
ENDING=$SPLIT
done
IFS="${OIFS}" # Reset Internal Field Seperator
IGNORE="0"
for IGNORED_ENDING in ${IGNORED_ENDINGS[#]}
do
if [ `echo $IGNORED_ENDING | tr [:upper:] [:lower:]` == `echo $ENDING | tr [:upper:] [:lower:]` ] # case insensitive compare of strings
then
IGNORE="1"
fi
done
if [ "$IGNORE" == "1" ]; then
continue
fi
# Read changed file and pipe it to iconv to parse it as UTF-8
$SVNLOOK cat -t "$TXN" "$REPOS" "$FILEPATH" | $ICONV -f UTF-8 -t UTF-16 -o /dev/null
# If iconv exited with a non-zero value (error) then return error text and reject commit
if [ "${PIPESTATUS[1]}" != "0" ]; then
echo "Only UTF-8 files can be committed (violated in $FILEPATH)" 1>&2
exit 1
fi
IFS="${NIFS}"
done
IFS="${OIFS}"
# All checks passed, so allow the commit.
exit 0
The problem is, every time I try to commit a file with Scandinavian characters like "æøå" iconv returns an error (exit 1).
If I disable the script, commit the file with "æøå", change the -t (transaction) in "svnlook -t" and "svnlook cat -t" to a -r (revision), and run the script manually with the revision number of the "æøå" file, then iconv (and therefor the script) returns exit 0. And everthing is dandy.
Why does svnlook cat -r work as expected (returning UTF-8 encoded "æøå" string), but not svnlook cat -t?
The problem was that iconv apparently behaves unexpectedly if no output encoding is selected.
Changing
$SVNLOOK cat -t "$TXN" "$REPOS" "$FILEPATH" | $ICONV -f UTF-8 -o /dev/null
to
$SVNLOOK cat -t "$TXN" "$REPOS" "$FILEPATH" | $ICONV -f UTF-8 -t UTF-16 -o /dev/null
solved the problem, and made the script behave as expected :)

Resources