Change all non-ascii chars to ascii Bash Scripting - linux

I am trying to write a script that take people names as an arguments and create a folder with their names. But in folder names, the non-ascii chars and whitespaces can sometimes make problem so I want to remove or change them to ascii chars.
I can remove the whitespace between name and surname but I can not figure out how can I change ş->s, ç->c, ğ->g, ı->i, ö->o.
Here is my code :
#!/bin/bash
ARRAY=("$#")
ELEMENTS=${#ARRAY[#]}
for (( i=0;i<$ELEMENTS;i++))
do #C-like for loop syntax
echo ${ARRAY[$i]} | grep "[^ ]*\b" | tr -d ' '
done
I run my script like that myscript.sh 'Çişil Aksoy' 'Cem Dalgıç'
It should change the arguments like : CisilAksoy CemDalgic
Thanks in advance
EDIT :
I found this solution, this does not look very pretty but it works.
sed 's/ş/s/gI; s/ç/c/gI; s/ü/u/gI; s/ö/o/gI; s/ı/i/gI;'
EDIT2 : SOLVED
#!/bin/bash
ARRAY=("$#")
ELEMENTS=${#ARRAY[#]}
for (( i=0;i<$ELEMENTS;i++))
do #C-like for loop syntax
v=$(echo ${ARRAY[$i]} | grep "[^ ]*\b" | tr -d ' ' | sed 's/ş/s/gI; s/ç/c/gI; s/ü/u/gI; s/ö/o/gI; s/ı/i/gI;')
mkdir $v
done

Anything that converts from UTF-8 to ASCII is going to be a compromise.
The iconv program does what was requested (not necessarily satisfying everyone, as in Transliterate any convertible utf8 char into ascii equivalent). Given
Çişil Aksoy' 'Cem Dalgıç
in "foo.txt", and the command
iconv -f UTF8 -t ASCII//TRANSLIT <foo.txt
that would give
Cisil Aksoy' 'Cem Dalg?c
The lynx browser has a different set of ASCII approximations. Using this command
lynx -display_charset=us-ascii -force_html -nolist -dump foo.txt
I get this result:
C,isil Aksoy' 'Cem Dalgic,

Simply put, you can't. ASCII only supports 128 characters.
International characters typically use some variation of Unicode, which can store a much much greater number of characters.
I think your best bet is to identify WHY your folder creation fails when using these characters. Does the method or function not support Unicode? If it does, figure out how to specify that instead of ASCII. If not, you might be stuck with sed and/or tr, which is probably not sustainable.
[UPDATED]
You should be able to substitute multiple characters via tr like follows:
echo şğıö | tr şçğıö scgio
sgio
(I removed my comment from earlier. I tried it on a different server and it worked fine.)

Related

sed doesn't remove characters from UTF range properly

I want to clear my file from all characters except russian and arabic letters, "|" and space mark. Lets start with only arabic letters. So I have:
cat file.tzt | sed 's/[^\u0600-\u06FF]//g'
sed: -e expression #1, char 21: Invalid range end.
I have tried [\u0621-\u064A] - same.
I also tried to use {Arabic}, but it doesn't clean files properly at all.
Error looks kinda strange for me. Obviously, 064FF > 0621.
So, overall I want to have something like this:
cat file.tzt | sed 's/[^\u0600-\u06FFа-яА-Я |]//g'
And I am ok with awk or any other utility, but as I know sed is stable and reliable.
Perl understands UTF-8:
perl -CSD -pe 's/[^\N{U+0600}-\N{U+06FF}]//g' -- file.txt
-C turns of UTF-8 support, S means for stdin/stdout/stderr, D means for any i/o streams.
You can also use Unicode properties:
s/\P{Cyrillic}//g

Remove lines with japanese characters from a file

First question on here- I've searched around to put together an answer to this but have come up empty thus far.
I have a multi-line text file that I am cleaning up. Part of this is to remove lines that include Japanese characters. I have been using sed for my other operations but it is not working in this instance.
I was under the impression that using the -r switch and the \p{Han} regular expression would work (from looking at other questions of this kind), but it is not working in this case.
Here is my test string - running this returns the full string, and does not filter out the JP characters as I was expecting.
echo 80岁返老还童的处女: 第3话 | sed -r "s/\\p\{Han\}//g"
Am I missing something? Is there another command I should be using instead?
I think this might work for you:
echo "80岁返老还童的处女: 第3话" | tr -cd '[:print:]\n'
sed doesn't support unicode classes AFAIK, and nor support multibyte ranges.
-d deletes characters in SET1, and -c reverses it.
[:print:] matches all printable characters including space.
\n is a newline
The above will not only remove Japanese characters but all multibyte characters, including control characters.
Perl can also be used:
PERLIO=:utf8 perl -pe 's/\p{Han}//g' file
PERLIO=:utf8 tells Perl to tread input and output as UTF-8

Bash separation of line with newlines instead of spaces

I did two following commands:
With the first one I listed content of directory and stored it in variable.
Second one shows content of variable.
Now I decided that I want to separate listing not with spaces but with newlines, I do the following:
I get a mess. Why?
It's worth to note that when I changed command so, it worked as I wanted:
Could someone please explain, why 0x20 or 32 ( I tried this number too) is not treated Bash as space in this case?
tr simply doesn't recognize hex but octal. This would work:
tr '\040' '\n'
And the easier way to show your files is
shopt -s nullglob ## Optional.
printf '%s\n' *
The problem with tr '\0x20' is, tr is treating all the character sequence as literal characters. And the characters are 0, x, 2. Note all of theese characters were replaced in the output by \n. That's why you have .t instead of txt. Also 2 didn't appear too.
This is not bash, its tr which is making you unhappy. If you really want to iterate over file names there are better ways to do that.
for f in *; do
# do work with $f. But always use quotes. Like `"$f"`
done

Replacing strings with special characters with linux sed

I've read lots of posts to understand how to correctly escape white spaces and special characters inside strings using sed, but still i can't make it, here's what i'm trying to achieve.
I have a file containing the some strings like this one:
JAVA_OPTS="$JAVA_OPTS -Dorg.apache.catalina.jsessionid=some_value"
and i'm trying to replace 'some_value' using the following:
sed -i "s/^\(JAVA_OPTS=\"\$JAVA_OPTS[ \t]*-Dorg\.apache\.catalina\.jsessionid*=\s*\).*\$/\1$DORG_APACHE_CATALINA_JSESSIONID/" $JBOSS_CONFIGURATION/jboss.configuration
$JBOSS_CONFIGURATION is a variable containing an absolute Linux path.
jboss.configuration is a file i'm pointing as the target for replace
operations.
$DORG_APACHE_CATALINA_JSESSIONID contains the value i want instead
of 'some_value'.
Please note that the pattern:
JAVA_OPTS="$JAVA_OPTS -D
Is always present, and org.apache.catalina.jsessionid is an example of a variable value i'm trying to replace with this script.
What's missing/wrong ? i tried also escaping whitespaces using \s without success,
and echoing the whole gives me the following:
echo "s/^\(JAVA_OPTS=\"\$JAVA_OPTS[ \t]*-Dorg\.apache\.catalina\.jsessionid*=\s*\).*\$/\1$DORG_APACHE_CATALINA_JSESSIONID/"
s/^\(JAVA_OPTS="$JAVA_OPTS[ \t]*-Dorg\.apache\.catalina\.jsessionid*=\s*\).*$/\1/
is echo interpreting the search pattern as sed does ?
any info/help/alternative ways of doing it are highly welcome,
thank you all
echo 'JAVA_OPTS="$JAVA_OPTS -Dorg.apache.catalina.jsessionid=some_value"' | (export DORG_APACHE_CATALINA_JSESSIONID=FOO/BAR/FOOBAR; sed "s/^\(JAVA_OPTS=\"\$JAVA_OPTS[ \t]*-Dorg\.apache\.catalina\.jsessionid*=\s*\).*\$/\1${DORG_APACHE_CATALINA_JSESSIONID////\/}\"/")
Note the bash expansion (in order to escape any / that may trip up sed) and the extra \" after $DORG_APACHE_CATALINA_JSESSIONID in order to properly close the double quote. Other than that your sed expression works for me and the above command outputs the follwoing result:
JAVA_OPTS="$JAVA_OPTS -Dorg.apache.catalina.jsessionid=FOO/BAR/FOOBAR"
You can use sed like this:
sed -r '/\$JAVA_OPTS -D/{s/^(.+=).*$/\1'"$DORG_APACHE_CATALINA_JSESSIONID"'/;}' $JBOSS_CONFIGURATION/jboss.configuration
You can specify a pattern that'll match the desired string rather than trying to specify it exactly.
The following should work for you:
sed -i 's#^\(JAVA_OPTS.*Dorg.apache.catalina.jsessionid\)=\([^"]*\)"#\1='"$DORG_APACHE_CATALINA_JSESSIONID"'"#' $JBOSS_CONFIGURATION/jboss.configuration
sed 's/=\w.*$/='"$DORG_APACHE_CATALINA_JSESSIONID"'/' $JBOSS_CONFIGURATION/jboss.configuration

Find and replace Non breaking space characters in Bash

I have a document with some special characters like non-breaking space, non-breaking hyphen, and so on. I want to normalize this document and replace these special characters with space. In addition since the content of this document is gathered from different resources, I have different forms of "Yeh" (ی) in it, and I want to normalize them.
Is it possible to find and replace unicode characters in a document using sed command? Can I use Unicode codes instead of surface form of the character? for example can I use x00a0 instead of non-breaking space in sed command? How?
Sorry for bad explanation.
My documents are encoded in UTF8, and contain non-English characters. for example I have a document in Arabic, a document in Urdu, and one in Persian (Farsi). now I want to replace some of the characters in these files by another character.
By normalizing, I mean that I want to replace all forms of "Yeh" into one form. (As you might now, there are many forms of this character which is used in Arabic, but for simplification and some processing issues I want to unify all these forms.
To process UTF-8 files, you have to parse each characters from begin to end. If you need to do it efficiently, you have to write a real program rather then trying to script a solution.
If you just want to script it, it is easier to convert it to UTF-16 and then process the characters.
A fairly inefficient way would be:
#!/bin/bash
function px {
local a="$#"
local i=0
while [ $i -lt ${#a} ]
do
printf \\x${a:$i:2}
i=$(($i+2))
done
}
(iconv -f UTF8 -t UTF16 | od -x | cut -b 9- | xargs -n 1) |
if read utf16header
then
px $utf16header
out=''
while read line
do
if [ "$line" == "000a" ]
then
out=$out$line
px $out
out=''
else
# put your coversion logic here.
# e.g
# if [ "$line" == "0031" ] ; then
# line="0041"
# fi
out=$out$line
fi
done
fi | iconv -f UTF16 -t UTF8
This might work for you (GNU sed):
echo abcd | sed 'p;y/\x61\x62\x63/ABC/'
abcd
ABCd

Resources