Why does iconv exit with error when reading from SVN transaction and not from SVN revision? - linux

%% Problem solved and the code below acts as expected %%
I'm trying to write a SVN pre-commit hook in Bash that tests incoming files for UTF-8 encoding. After a lot of string juggling to get the path of the incoming files and ignore dirs/pictures/deleted files and so on, I use 'svnlook cat' to read the incoming file and pipe it to 'iconv -f UTF-8'. After this I read the exit status of the iconv operation with ${PIPESTATUS[1]}.
My code look like this:
REPOS="$1"
TXN="$2"
SVNLOOK=/usr/bin/svnlook
ICONV=/usr/bin/iconv
# The file endings to ignore when checking for UTF-8:
IGNORED_ENDINGS=( png jar )
# Prepairing to set the IFS (Internal Field Separator) so "for CHANGE in ..." will iterate
# over lines instead of words
OIFS="${IFS}"
NIFS=$'\n'
# Make sure that all files to be committed are encoded in UTF-8
IFS="${NIFS}"
for CHANGE in $($SVNLOOK changed -t "$TXN" "$REPOS"); do
IFS="${OIFS}"
# Skip change if first character is "D" (we dont care about checking deleted files)
if [ "${CHANGE:0:1}" == "D" ]; then
continue
fi
# Skip change if it is a directory (directories don't have encoding)
if [ "${CHANGE:(-1)}" == "/" ]; then
continue
fi
# Extract file repository path (remove first 4 characters)
FILEPATH=${CHANGE:4:(${#CHANGE}-4)}
# Ignore files that starts with "." like ".classpath"
IFS="//" # Change seperator to "/" so we can find the file in the file path
for SPLIT in $FILEPATH
do
FILE=$SPLIT
done
if [ "${FILE:0:1}" == "." ]; then
continue
fi
IFS="${OIFS}" # Reset Internal Field Seperator
# Ignore files that are not supposed to be checked, like images. (list defined in IGNORED_ENDINGS field above)
IFS="." # Change seperator to "." so we can find the file ending
for SPLIT in $FILE
do
ENDING=$SPLIT
done
IFS="${OIFS}" # Reset Internal Field Seperator
IGNORE="0"
for IGNORED_ENDING in ${IGNORED_ENDINGS[#]}
do
if [ `echo $IGNORED_ENDING | tr [:upper:] [:lower:]` == `echo $ENDING | tr [:upper:] [:lower:]` ] # case insensitive compare of strings
then
IGNORE="1"
fi
done
if [ "$IGNORE" == "1" ]; then
continue
fi
# Read changed file and pipe it to iconv to parse it as UTF-8
$SVNLOOK cat -t "$TXN" "$REPOS" "$FILEPATH" | $ICONV -f UTF-8 -t UTF-16 -o /dev/null
# If iconv exited with a non-zero value (error) then return error text and reject commit
if [ "${PIPESTATUS[1]}" != "0" ]; then
echo "Only UTF-8 files can be committed (violated in $FILEPATH)" 1>&2
exit 1
fi
IFS="${NIFS}"
done
IFS="${OIFS}"
# All checks passed, so allow the commit.
exit 0
The problem is, every time I try to commit a file with Scandinavian characters like "æøå" iconv returns an error (exit 1).
If I disable the script, commit the file with "æøå", change the -t (transaction) in "svnlook -t" and "svnlook cat -t" to a -r (revision), and run the script manually with the revision number of the "æøå" file, then iconv (and therefor the script) returns exit 0. And everthing is dandy.
Why does svnlook cat -r work as expected (returning UTF-8 encoded "æøå" string), but not svnlook cat -t?

The problem was that iconv apparently behaves unexpectedly if no output encoding is selected.
Changing
$SVNLOOK cat -t "$TXN" "$REPOS" "$FILEPATH" | $ICONV -f UTF-8 -o /dev/null
to
$SVNLOOK cat -t "$TXN" "$REPOS" "$FILEPATH" | $ICONV -f UTF-8 -t UTF-16 -o /dev/null
solved the problem, and made the script behave as expected :)

Related

Multiple process curl command for urls to output to individual files

I am attempting to curl multiple urls in a bash command. Eventually I will be curling a large number of Urls so I am using xargs to use multiple processes to speed up the process.
My file consists of x number of URLs:
https://someurl.com
https://someotherurl.com
My issue comes when attempting to output the results to separate files named after the URLs I curl.
The bash command I have is:
xargs -P 5 -n 1 -I% curl -k -L % -0 % < urls.txt
When I run this I get 'Failed to create file https://someotherurl.com'
You cannot create a file with / in the filename. You could do it this way:
#!/bin/bash
while IFS= read -r line
do
echo "LINE: $line"
if [[ "$line" != "" ]]
then
filename="${line#https://}"
echo "FILENAME: $filename"
# YOUR CURL COMMAND HERE, USING $filename
fi
done < url.txt
it ignores empty lines
variable substitution is used to remove the https:// part of each URL
this will allow you to create the file
Note: if your URLs containt sub-directories, they must be removed as well.
Ex: you want to do https://www.exemple.com/some/sub/dir
The script I suggested here would try to create a file named "www.exemple.com/some/sub/dir". In this case, you could replace the / with _ using tr.
The script would become:
#!/bin/bash
while IFS= read -r line
do
echo "LINE: $line"
if [[ "$line" != "" ]]
then
filename=$(echo "$line" | tr '/' '_')
filename2=${filename#https:__}
echo "FILENAME: $filename2"
# YOUR CURL COMMAND HERE, USING $filename2
fi
done < url.txt
Because your question is ambiguous, I would assume:
You have a file urls.txt that contains URLs separated by LF.
You want to download all URLs by curl and use each URL as its filename.
Unfortunately, that's not possible because URL contains invalid characters like slash /. Alternatively, for this case, I would suggest you use Bsse64 safe mode to decode URL before saving to file based on RFC 3548.
After applying this requirement, your script would become like:
seq 100 | xargs -I# echo 'https://example.com?#' > urls.txt
xargs -P0 -L1 sh -c 'curl -SskL0 -o $(printf %s "$1" | uuencode -m /dev/stdout | sed "1d;\$d" | tr +/ -_) "$1"' sh < urls.txt

how to filter out / ignore specific lines when comparing text files with diff

To further clarify what I am trying to do, I wrote the script below. I am attempting to audit some files between my QA and PRD environments and would like the final Diff output to Ignore hard coded values such as sql connections. I have about 6 different values to filer. I have tried several ways thus far I am not able to get any of them to work as needed. I am open to doing this another way if anyone has any ideas. I am pretty new to script development so Im open to any ideas or information. Thanks :)
#!/bin/bash
#*********************************************************************
#
# Name: compareMD5.sh
# Date: 02/12/2018
# Script Location:
# Author: Maggie o
#
# Description: This script will pull absolute paths from a text file
# and compare the files via ssh between QA & PRD on md5sum
# output match or no match
# Then the file the non matching files will be imported to a
# tmp directory via scp
# Files will be compared locally and exclude whitespace,
# spaces, comments, and hard coded values
# NOTE: Script may take a several minutes to run
#
# Usage: Auditing QA to PRD Pass 3
# nohup ./compareMD52.sh > /output/compareMD52.out 2> /error/compareMD52.err
# checking run ps -ef | grep compareMD52*
#**********************************************************************
rm /output/no_matchMD5.txt
rm /output/filesDiffer.txt
echo "Filename | Path" > /output/matchingMD5.txt
#Remove everything below tmp directory recursivly as it was created by previous script run
rm -rf /tmp/*
for i in $(cat /input/comp_list.txt) #list of files with absolute paths output by compare script
do
export filename=$(basename "$i") #Grab just the filename
export path=$(dirname "$i") #Just the Directory
qa_md5sum=$(md5sum "$i") #Get the md5sum
qa_md5="${qa_md5sum%% *}" #remove the appended path
export tmpdir=(/tmp"$path")
# if the stat is not null then run the if, if file is exisiting
if ssh oracle#Someconnection stat $path'$filename' \> /dev/null 2\>\&1
then
prd_md5sum=$(ssh oracle#Somelocation "cd $path; find -name '$filename' -
exec md5sum {} \;")
prd_md5="${prd_md5sum%% *}" #remove the appended path
if [[ $qa_md5 == $prd_md5 ]] #Match hash as integer
then
echo $filename $path " QA Matches PRD">> /output/matchingMD5.txt
else
echo $i
echo $tmpdir
echo "Copying "$i" to "$tmpdir >> /output/no_matchMD5.txt
#Copy the file from PRD to a tmp Dir in QA, keep dir structure to avoid issues of same filename exisiting in diffrent directorys
mkdir -p $tmpdir # -p creates only if not exisiting, does not produce errors if exisiting
scp oracle#Somelocation:$i $tmpdir # get the file from Prd, Insert into tmp Directory
fi
fi
done
for x in $(cat /output/no_matchMD5.txt) #do a local comapare using diff
do
comp_filename=$(basename "$x")
#Ignore Comments, no white space, no blank lines, and only report if different but not How different
qa=(/tmp"$x")
#IN TEST
if diff -bBq -I '^#' $x $qa >/dev/null
# Fails to catch files if the Comment then the start of a line
then
echo $comp_filename " differs more then just white space, or
comment"
echo $x >> /output/filesDiffer.txt
fi
done
You can pipe the output into grep -v
Like this:
diff -bBq TEST.sh TEST2.sh | grep -v "^#"
I was able to get this figured out using this method
if diff -bBqZ -I '^#' <(grep -vE '(thing1|thing2|thing3)' $x) <(grep -vE '(thing1|thing2|thing3)' $prdfile)

grep from tar.gz without extracting [faster one]

Am trying to grep pattern from dozen files .tar.gz but its very slow
am using
tar -ztf file.tar.gz | while read FILENAME
do
if tar -zxf file.tar.gz "$FILENAME" -O | grep "string" > /dev/null
then
echo "$FILENAME contains string"
fi
done
If you have zgrep you can use
zgrep -a string file.tar.gz
You can use the --to-command option to pipe files to an arbitrary script. Using this you can process the archive in a single pass (and without a temporary file). See also this question, and the manual.
Armed with the above information, you could try something like:
$ tar xf file.tar.gz --to-command "awk '/bar/ { print ENVIRON[\"TAR_FILENAME\"]; exit }'"
bfe2/.bferc
bfe2/CHANGELOG
bfe2/README.bferc
I know this question is 4 years old, but I have a couple different options:
Option 1: Using tar --to-command grep
The following line will look in example.tgz for PATTERN. This is similar to #Jester's example, but I couldn't get his pattern matching to work.
tar xzf example.tgz --to-command 'grep --label="$TAR_FILENAME" -H PATTERN ; true'
Option 2: Using tar -tzf
The second option is using tar -tzf to list the files, then go through them with grep. You can create a function to use it over and over:
targrep () {
for i in $(tar -tzf "$1"); do
results=$(tar -Oxzf "$1" "$i" | grep --label="$i" -H "$2")
echo "$results"
done
}
Usage:
targrep example.tar.gz "pattern"
Both the below options work well.
$ zgrep -ai 'CDF_FEED' FeedService.log.1.05-31-2019-150003.tar.gz | more
2019-05-30 19:20:14.568 ERROR 281 --- [http-nio-8007-exec-360] DrupalFeedService : CDF_FEED_SERVICE::CLASSIFICATION_ERROR:408: Classification failed even after maximum retries for url : abcd.html
$ zcat FeedService.log.1.05-31-2019-150003.tar.gz | grep -ai 'CDF_FEED'
2019-05-30 19:20:14.568 ERROR 281 --- [http-nio-8007-exec-360] DrupalFeedService : CDF_FEED_SERVICE::CLASSIFICATION_ERROR:408: Classification failed even after maximum retries for url : abcd.html
If this is really slow, I suspect you're dealing with a large archive file. It's going to uncompress it once to extract the file list, and then uncompress it N times--where N is the number of files in the archive--for the grep. In addition to all the uncompressing, it's going to have to scan a fair bit into the archive each time to extract each file. One of tar's biggest drawbacks is that there is no table of contents at the beginning. There's no efficient way to get information about all the files in the archive and only read that portion of the file. It essentially has to read all of the file up to the thing you're extracting every time; it can't just jump to a filename's location right away.
The easiest thing you can do to speed this up would be to uncompress the file first (gunzip file.tar.gz) and then work on the .tar file. That might help enough by itself. It's still going to loop through the entire archive N times, though.
If you really want this to be efficient, your only option is to completely extract everything in the archive before processing it. Since your problem is speed, I suspect this is a giant file that you don't want to extract first, but if you can, this will speed things up a lot:
tar zxf file.tar.gz
for f in hopefullySomeSubdir/*; do
grep -l "string" $f
done
Note that grep -l prints the name of any matching file, quits after the first match, and is silent if there's no match. That alone will speed up the grepping portion of your command, so even if you don't have the space to extract the entire archive, grep -l will help. If the files are huge, it will help a lot.
For starters, you could start more than one process:
tar -ztf file.tar.gz | while read FILENAME
do
(if tar -zxf file.tar.gz "$FILENAME" -O | grep -l "string"
then
echo "$FILENAME contains string"
fi) &
done
The ( ... ) & creates a new detached (read: the parent shell does not wait for the child)
process.
After that, you should optimize the extracting of your archive. The read is no problem,
as the OS should have cached the file access already. However, tar needs to unpack
the archive every time the loop runs, which can be slow. Unpacking the archive once
and iterating over the result may help here:
local tempPath=`tempfile`
mkdir $tempPath && tar -zxf file.tar.gz -C $tempPath &&
find $tempPath -type f | while read FILENAME
do
(if grep -l "string" "$FILENAME"
then
echo "$FILENAME contains string"
fi) &
done && rm -r $tempPath
find is used here, to get a list of files in the target directory of tar, which we're iterating over, for each file searching for a string.
Edit: Use grep -l to speed up things, as Jim pointed out. From man grep:
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which output would
normally have been printed. The scanning will stop on the first match. (-l is specified
by POSIX.)
Am trying to grep pattern from dozen files .tar.gz but its very slow
tar -ztf file.tar.gz | while read FILENAME
do
if tar -zxf file.tar.gz "$FILENAME" -O | grep "string" > /dev/null
then
echo "$FILENAME contains string"
fi
done
That's actually very easy with ugrep option -z:
-z, --decompress
Decompress files to search, when compressed. Archives (.cpio,
.pax, .tar, and .zip) and compressed archives (e.g. .taz, .tgz,
.tpz, .tbz, .tbz2, .tb2, .tz2, .tlz, and .txz) are searched and
matching pathnames of files in archives are output in braces. If
-g, -O, -M, or -t is specified, searches files within archives
whose name matches globs, matches file name extensions, matches
file signature magic bytes, or matches file types, respectively.
Supported compression formats: gzip (.gz), compress (.Z), zip,
bzip2 (requires suffix .bz, .bz2, .bzip2, .tbz, .tbz2, .tb2, .tz2),
lzma and xz (requires suffix .lzma, .tlz, .xz, .txz).
Which requires just one command to search file.tar.gz as follows:
ugrep -z "string" file.tar.gz
This greps each of the archived files to display matches. Archived filenames are shown in braces to distinguish them from ordinary filenames. For example:
$ ugrep -z "Hello" archive.tgz
{Hello.bat}:echo "Hello World!"
Binary file archive.tgz{Hello.class} matches
{Hello.java}:public class Hello // prints a Hello World! greeting
{Hello.java}: { System.out.println("Hello World!");
{Hello.pdf}:(Hello)
{Hello.sh}:echo "Hello World!"
{Hello.txt}:Hello
If you just want the file names, use option -l (--files-with-matches) and customize the filename output with option --format="%z%~" to get rid of the braces:
$ ugrep -z Hello -l --format="%z%~" archive.tgz
Hello.bat
Hello.class
Hello.java
Hello.pdf
Hello.sh
Hello.txt
All of the code above was really helpful, but none of it quite answered my own need: grep all *.tar.gz files in the current directory to find a pattern that is specified as an argument in a reusable script to output:
The name of both the archive file and the extracted file
The line number where the pattern was found
The contents of the matching line
It's what I was really hoping that zgrep could do for me and it just can't.
Here's my solution:
pattern=$1
for f in *.tar.gz; do
echo "$f:"
tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true";
done
You can also replace the tar line with the following if you'd like to test that all variables are expanding properly with a basic echo statement:
tar -xzf "$f" --to-command 'echo "f:`basename $TAR_FILENAME` s:'"$pattern\""
Let me explain what's going on. Hopefully, the for loop and the echo of the archive filename in question is obvious.
tar -xzf: x extract, z filter through gzip, f based on the following archive file...
"$f": The archive file provided by the for loop (such as what you'd get by doing an ls) in double-quotes to allow the variable to expand and ensure that the script is not broken by any file names with spaces, etc.
--to-command: Pass the output of the tar command to another command rather than actually extracting files to the filesystem. Everything after this specifies what the command is (grep) and what arguments we're passing to that command.
Let's break that part down by itself, since it's the "secret sauce" here.
'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
First, we use a single-quote to start this chunk so that the executed sub-command (basename $TAR_FILENAME) is not immediately expanded/resolved. More on that in a moment.
grep: The command to be run on the (not actually) extracted files
--label=: The label to prepend the results, the value of which is enclosed in double-quotes because we do want to have the grep command resolve the $TAR_FILENAME environment variable passed in by the tar command.
basename $TAR_FILENAME: Runs as a command (surrounded by backticks) and removes directory path and outputs only the name of the file
-Hin: H Display filename (provided by the label), i Case insensitive search, n Display line number of match
Then we "end" the first part of the command string with a single quote and start up the next part with a double quote so that the $pattern, passed in as the first argument, can be resolved.
Realizing which quotes I needed to use where was the part that tripped me up the longest. Hopefully, this all makes sense to you and helps someone else out. Also, I hope I can find this in a year when I need it again (and I've forgotten about the script I made for it already!)
And it's been a bit a couple of weeks since I wrote the above and it's still super useful... but it wasn't quite good enough as files have piled up and searching for things has gotten more messy. I needed a way to limit what I looked at by the date of the file (only looking at more recent files). So here's that code. Hopefully it's fairly self-explanatory.
if [ -z "$1" ]; then
echo "Look within all tar.gz files for a string pattern, optionally only in recent files"
echo "Usage: targrep <string to search for> [start date]"
fi
pattern=$1
startdatein=$2
startdate=$(date -d "$startdatein" +%s)
for f in *.tar.gz; do
filedate=$(date -r "$f" +%s)
if [[ -z "$startdatein" ]] || [[ $filedate -ge $startdate ]]; then
echo "$f:"
tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
fi
done
And I can't stop tweaking this thing. I added an argument to filter by the name of the output files in the tar file. Wildcards work, too.
Usage:
targrep.sh [-d <start date>] [-f <filename to include>] <string to search for>
Example:
targrep.sh -d "1/1/2019" -f "*vehicle_models.csv" ford
while getopts "d:f:" opt; do
case $opt in
d) startdatein=$OPTARG;;
f) targetfile=$OPTARG;;
esac
done
shift "$((OPTIND-1))" # Discard options and bring forward remaining arguments
pattern=$1
echo "Searching for: $pattern"
if [[ -n $targetfile ]]; then
echo "in filenames: $targetfile"
fi
startdate=$(date -d "$startdatein" +%s)
for f in *.tar.gz; do
filedate=$(date -r "$f" +%s)
if [[ -z "$startdatein" ]] || [[ $filedate -ge $startdate ]]; then
echo "$f:"
if [[ -z "$targetfile" ]]; then
tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
else
tar -xzf "$f" --no-anchored "$targetfile" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
fi
fi
done
zgrep works fine for me, only if all files inside is plain text.
it looks nothing works if the tgz file contains gzip files.
You can mount the TAR archive with ratarmount and then simply search for the pattern in the mounted view:
pip install --user ratarmount
ratarmount large-archive.tar mountpoint
grep -r '<pattern>' mountpoint/
This is much faster than iterating over each file and piping it to grep separately, especially for compressed TARs. Here are benchmark results in seconds for a 55 MiB uncompressed and 42 MiB compressed TAR archive containing 40 files:
Compression
Ratarmount
Bash Loop over tar -O
none
0.31 +- 0.01
0.55 +- 0.02
gzip
1.1 +- 0.1
13.5 +- 0.1
bzip2
1.2 +- 0.1
97.8 +- 0.2
Of course, these results are highly dependent on the archive size and how many files the archive contains. These test examples are pretty small because I didn't want to wait too long. But, they already exemplify the problem well enough. The more files there are, the longer it takes for tar -O to jump to the correct file. And for compressed archives, it will be quadratically slower the larger the archive size is because everything before the requested file has to be decompressed and each file is requested separately. Both of these problems are solved by ratarmount.
This is the code for benchmarking:
function checkFilesWithRatarmount()
{
local pattern=$1
local archive=$2
ratarmount "$archive" "$archive.mountpoint"
'grep' -r -l "$pattern" "$archive.mountpoint/"
}
function checkEachFileViaStdOut()
{
local pattern=$1
local archive=$2
tar --list --file "$archive" | while read -r file; do
if tar -x --file "$archive" -O -- "$file" | grep -q "$pattern"; then
echo "Found pattern in: $file"
fi
done
}
function createSampleTar()
{
for i in $( seq 40 ); do
head -c $(( 1024 * 1024 )) /dev/urandom | base64 > $i.dat
done
tar -czf "$1" [0-9]*.dat
}
createSampleTar myarchive.tar.gz
time checkEachFileViaStdOut ABCD myarchive.tar.gz
time checkFilesWithRatarmount ABCD myarchive.tar.gz
sleep 0.5s
fusermount -u myarchive.tar.gz.mountpoint
In my case the tarballs have a lot of tiny files and I want to know what archived file inside the tarball matches. zgrep is fast (less than one second) but doesn't provide the info I want, and tar --to-command grep is much, much slower (many minutes)1.
So I went the other direction and had zgrep tell me the byte offsets of the matches in the tarball and put that together with the list of offsets in the tarball of all archived files to find the matching archived files.
#!/bin/bash
set -e
set -o pipefail
function tar_offsets() {
# Get the byte offsets of all the files in a given tarball
# based on https://stackoverflow.com/a/49865044/60422
[ $# -eq 1 ]
tar -tvf "$1" -R | awk '
BEGIN{
getline;
f=$8;
s=$5;
}
{
offset = int($2) * 512 - and((s+511), compl(512)+1)
print offset,s,f;
f=$8;
s=$5;
}'
}
function tar_byte_offsets_to_files() {
[ $# -eq 1 ]
# Convert the search results of a tarball with byte offsets
# to search results with archived file name and offset, using
# the provided tar_offsets output (single pass, suitable for
# process substitution)
offsets_file="$1"
prev_offset=0
prev_offset_filename=""
IFS=' ' read -r last_offset last_len last_offset_filename < "$offsets_file"
while IFS=':' read -r search_result_offset match_text
do
while [ $last_offset -lt $search_result_offset ]; do
prev_offset=$last_offset
prev_offset_filename="$last_offset_filename"
IFS=' ' read -r last_offset last_len last_offset_filename < "$offsets_file"
# offsets increasing safeguard
[ $prev_offset -le $last_offset ]
done
# now last offset is the first file strictly after search result offset so prev offset is
# the one at or before it, and must be the one it is in
result_file_offset=$(( $search_result_offset - $prev_offset ))
echo "$prev_offset_filename:$result_file_offset:$match_text"
done
}
# Putting it together e.g.
zgrep -a --byte-offset "your search here" some.tgz | tar_byte_offsets_to_files <(tar_offsets some.tgz)
1 I'm running this in Git for Windows' minimal MSYS2 fork unixy environment, so it's possible that the launch overhead of grep is much much higher than on any kind of real Unix machine and would make `tar --to-command grep` good enough there; benchmark solutions for your own needs and platform situation before selecting.

How to convert ISO8859-15 to UTF8?

I have an Arabic file encoded in ISO8859-15. How can I convert it into UTF8?
I used iconv but it doesn't work for me.
iconv -f ISO-8859-15 -t UTF-8 Myfile.txt
I wanted to attach the file, but I don't know how.
Could it be that your file is not ISO-8859-15 encoded? You should be able to check with the file command:file YourFile.txt
Also, you can use iconv without providing the encoding of the original file:iconv -t UTF-8 YourFile.txt
I found this to work for me:
iconv -f ISO-8859-14 Agreement.txt -t UTF-8 -o agreement.txt
I have ubuntu 14 and the other answers where no working for me
iconv -f ISO-8859-1 -t UTF-8 in.tex > out.tex
I found this command here
We have this problem and to solve
Create a script file called to-utf8.sh
#!/bin/bash
TO="UTF-8"; FILE=$1
FROM=$(file -i $FILE | cut -d'=' -f2)
if [[ $FROM = "binary" ]]; then
echo "Skipping binary $FILE..."
exit 0
fi
iconv -f $FROM -t $TO -o $FILE.tmp $FILE; ERROR=$?
if [[ $ERROR -eq 0 ]]; then
echo "Converting $FILE..."
mv -f $FILE.tmp $FILE
else
echo "Error on $FILE"
fi
Set the executable bit
chmod +x to-utf8.sh
Do a conversion
./to-utf8.sh MyFile.txt
If you want to convert all files under a folder, do
find /your/folder/here | xargs -n 1 ./to-utf8.sh
Hope it's help.
I got the same problem, but i find the answer in this page! it works for me, you can try it.
iconv -f cp936 -t utf-8 
in my case, the file command tells a wrong encoding, so i tried converting with all the possible encodings, and found out the right one.
execute this script and check the result file.
for i in `iconv -l`
do
echo $i
iconv -f $i -t UTF-8 yourfile | grep "hint to tell converted success or not"
done &>/tmp/converted
You can use ISO-8859-9 encoding:
iconv -f ISO-8859-9 Agreement.txt -t UTF-8 -o agreement.txt
Iconv just writes the converted text to stdout. You have to use -o OUTPUTFILE.txt as an parameter or write stdout to a file. (iconv -f x -t z filename.txt > OUTPUTFILE.txt or iconv -f x -t z < filename.txt > OUTPUTFILE.txt in some iconv versions)
Synopsis
iconv -f encoding -t encoding inputfile
Description
The iconv program converts the encoding of characters in inputfile from one coded character set to another.
**The result is written to standard output unless otherwise specified by the --output option.**
--from-code, -f encoding
Convert characters from encoding
--to-code, -t encoding
Convert characters to encoding
--list
List known coded character sets
--output, -o file
Specify output file (instead of stdout)
--verbose
Print progress information.

iconv any encoding to UTF-8

I am trying to point iconv to a directory and all files will be converted UTF-8 regardless of the current encoding
I am using this script but you have to specify what encoding you are going FROM. How can I make it autdetect the current encoding?
dir_iconv.sh
#!/bin/bash
ICONVBIN='/usr/bin/iconv' # path to iconv binary
if [ $# -lt 3 ]
then
echo "$0 dir from_charset to_charset"
exit
fi
for f in $1/*
do
if test -f $f
then
echo -e "\nConverting $f"
/bin/mv $f $f.old
$ICONVBIN -f $2 -t $3 $f.old > $f
else
echo -e "\nSkipping $f - not a regular file";
fi
done
terminal line
sudo convert/dir_iconv.sh convert/books CURRENT_ENCODING utf8
Maybe you are looking for enca:
Enca is an Extremely Naive Charset Analyser. It detects character set and encoding of text files and can also convert them to other encodings using either a built-in converter or external libraries and tools like libiconv, librecode, or cstocs.
Currently it supports Belarusian, Bulgarian, Croatian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Russian, Slovak, Slovene, Ukrainian, Chinese, and some multibyte encodings independently on language.
Note that in general, autodetection of current encoding is a difficult process (the same byte sequence can be correct text in multiple encodings). enca uses heuristics based on the language you tell it to detect (to limit the number of encodings). You can use enconv to convert text files to a single encoding.
You can get what you need using standard gnu utils file and awk. Example:
file -bi .xsession-errors
gives me:
"text/plain; charset=us-ascii"
so file -bi .xsession-errors |awk -F "=" '{print $2}'
gives me
"us-ascii"
I use it in scripts like so:
CHARSET="$(file -bi "$i"|awk -F "=" '{print $2}')"
if [ "$CHARSET" != utf-8 ]; then
iconv -f "$CHARSET" -t utf8 "$i" -o outfile
fi
Compiling all them. Go to dir, create dir2utf8.sh:
#!/bin/bash
# converting all files in a dir to utf8
for f in *
do
if test -f $f then
echo -e "\nConverting $f"
CHARSET="$(file -bi "$f"|awk -F "=" '{print $2}')"
if [ "$CHARSET" != utf-8 ]; then
iconv -f "$CHARSET" -t utf8 "$f" -o "$f"
fi
else
echo -e "\nSkipping $f - it's a regular file";
fi
done
Here is my solution to in place all files using recode and uchardet:
#!/bin/bash
apt-get -y install recode uchardet > /dev/null
find "$1" -type f | while read FFN # 'dir' should be changed...
do
encoding=$(uchardet "$FFN")
echo "$FFN: $encoding"
enc=`echo $encoding | sed 's#^x-mac-#mac#'`
set +x
recode $enc..UTF-8 "$FFN"
done
put it into convert-dir-to-utf8.sh and run:
bash convert-dir-to-utf8.sh /pat/to/my/trash/dir
Note that sed is a workaround for mac encodings here.
Many uncommon encodings need workarounds like this.
First answer
#!/bin/bash
find "<YOUR_FOLDER_PATH>" -name '*' -type f -exec grep -Iq . {} \; -print0 |
while IFS= read -r -d $'\0' LINE_FILE; do
CHARSET=$(uchardet $LINE_FILE)
echo "Converting ($CHARSET) $LINE_FILE"
# NOTE: Convert/reconvert to utf8. By Questor
iconv -f "$CHARSET" -t utf8 "$LINE_FILE" -o "$LINE_FILE"
# NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
# [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
# https://stackoverflow.com/a/45240995/3223785 ]
sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"
done
# [Refs.: https://justrocketscience.com/post/handle-encodings ,
# https://stackoverflow.com/a/9612232/3223785 ,
# https://stackoverflow.com/a/13659891/3223785 ]
FURTHER QUESTION: I do not know if my approach is the safest. I say this because I noticed that some files are not correctly converted (characters will be lost) or are "truncated". I suspect that this has to do with the "iconv" tool or with the charset information obtained with the "uchardet" tool. I was curious about the solution presented by #demofly because it could be safer.
Another answer
Based on #demofly 's answer:
#!/bin/bash
find "<YOUR_FOLDER_PATH>" -name '*' -type f -exec grep -Iq . {} \; -print0 |
while IFS= read -r -d $'\0' LINE_FILE; do
CHARSET=$(uchardet $LINE_FILE)
REENCSED=`echo $CHARSET | sed 's#^x-mac-#mac#'`
echo "\"$CHARSET\" \"$LINE_FILE\""
# NOTE: Convert/reconvert to utf8. By Questor
recode $REENCSED..UTF-8 "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP
STDERR_OP=$(cat STDERR_OP)
rm -f STDERR_OP
if [ -n "$STDERR_OP" ] ; then
# NOTE: Convert/reconvert to utf8. By Questor
iconv -f "$CHARSET" -t utf8 "$LINE_FILE" -o "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP
STDERR_OP=$(cat STDERR_OP)
rm -f STDERR_OP
fi
# NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
# [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
# https://stackoverflow.com/a/45240995/3223785 ]
sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"
if [ -n "$STDERR_OP" ] ; then
echo "ERROR: \"$STDERR_OP\""
fi
STDOUT_OP=$(cat STDOUT_OP)
rm -f STDOUT_OP
if [ -n "$STDOUT_OP" ] ; then
echo "RESULT: \"$STDOUT_OP\""
fi
done
# [Refs.: https://justrocketscience.com/post/handle-encodings ,
# https://stackoverflow.com/a/9612232/3223785 ,
# https://stackoverflow.com/a/13659891/3223785 ]
Third answer
Hybrid solution with recode and vim:
#!/bin/bash
find "<YOUR_FOLDER_PATH>" -name '*' -type f -exec grep -Iq . {} \; -print0 |
while IFS= read -r -d $'\0' LINE_FILE; do
CHARSET=$(uchardet $LINE_FILE)
REENCSED=`echo $CHARSET | sed 's#^x-mac-#mac#'`
echo "\"$CHARSET\" \"$LINE_FILE\""
# NOTE: Convert/reconvert to utf8. By Questor
recode $REENCSED..UTF-8 "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP
STDERR_OP=$(cat STDERR_OP)
rm -f STDERR_OP
if [ -n "$STDERR_OP" ] ; then
# NOTE: Convert/reconvert to utf8. By Questor
bash -c "</dev/tty vim -u NONE +\"set binary | set noeol | set nobomb | set encoding=utf-8 | set fileencoding=utf-8 | wq\" \"$LINE_FILE\""
else
# NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
# [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
# https://stackoverflow.com/a/45240995/3223785 ]
sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"
fi
done
This was the solution with the highest number of perfect conversions. Additionally, we did not have any truncated files.
WARNING: Make a backup of your files and use a merge tool to check/compare the changes. Problems probably will appear!
TIP: The command sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE" can be executed after a preliminary comparison with the merge tool after a conversion without it since it can cause "differences".
NOTE: The search using find brings all non-binary files from the given path ("") and its subfolders.
Check out tools available for a data convertation in a linux cli: https://www.debian.org/doc/manuals/debian-reference/ch11.en.html
Also, there is a quest to find out a full list of encodings which are available in iconv. Just run iconv --list and find out that encoding names differs from names returned by uchardet tool (for example: x-mac-cyrillic in uchardet vs. mac-cyrillic in iconv)
enca command doesn't work for my Simplified-Chinese text file with GB2312 encoding.
Instead, I use the following function to convert the text file for me.
You could of course re-direct the output into a file.
It requires chardet and iconv commands.
detection_cat ()
{
DET_OUT=$(chardet $1);
ENC=$(echo $DET_OUT | sed "s|^.*: \(.*\) (confid.*$|\1|");
iconv -f $ENC $1
}
use iconv and uchardet (thx farseerfc)
fish shell
cat your_file | iconv -f (uchardet your_file ) -t UTF-8
bash shell
cat your_file | iconv -f $(uchardet your_file ) -t UTF-8
if use bash script
#!/usr/bin/bash
for fn in "$#"
do
iconv < "$fn" -f $(uchardet "$fn") -t utf8
done
by #flowinglight at ubuntu group.

Resources