Find and replace Non breaking space characters in Bash - linux

I have a document with some special characters like non-breaking space, non-breaking hyphen, and so on. I want to normalize this document and replace these special characters with space. In addition since the content of this document is gathered from different resources, I have different forms of "Yeh" (ی) in it, and I want to normalize them.
Is it possible to find and replace unicode characters in a document using sed command? Can I use Unicode codes instead of surface form of the character? for example can I use x00a0 instead of non-breaking space in sed command? How?
Sorry for bad explanation.
My documents are encoded in UTF8, and contain non-English characters. for example I have a document in Arabic, a document in Urdu, and one in Persian (Farsi). now I want to replace some of the characters in these files by another character.
By normalizing, I mean that I want to replace all forms of "Yeh" into one form. (As you might now, there are many forms of this character which is used in Arabic, but for simplification and some processing issues I want to unify all these forms.

To process UTF-8 files, you have to parse each characters from begin to end. If you need to do it efficiently, you have to write a real program rather then trying to script a solution.
If you just want to script it, it is easier to convert it to UTF-16 and then process the characters.
A fairly inefficient way would be:
#!/bin/bash
function px {
local a="$#"
local i=0
while [ $i -lt ${#a} ]
do
printf \\x${a:$i:2}
i=$(($i+2))
done
}
(iconv -f UTF8 -t UTF16 | od -x | cut -b 9- | xargs -n 1) |
if read utf16header
then
px $utf16header
out=''
while read line
do
if [ "$line" == "000a" ]
then
out=$out$line
px $out
out=''
else
# put your coversion logic here.
# e.g
# if [ "$line" == "0031" ] ; then
# line="0041"
# fi
out=$out$line
fi
done
fi | iconv -f UTF16 -t UTF8

This might work for you (GNU sed):
echo abcd | sed 'p;y/\x61\x62\x63/ABC/'
abcd
ABCd

Related

How do I add the first 2 letters of every line in a file to a list using bash?

I have a file ($ScriptName). I want the first 2 charactors of every line to be in a list (Starters). I am using a bash script.
How would I do this?
I have declared my array like this:
array=() #Empty array
Using guidence from this: https://opensource.com/article/18/5/you-dont-know-bash-intro-bash-arrays
I am using manjaro 19 and the latest kernel.
To get the first two characters from each line, you can use
cut -c1,2 "$ScriptName"
-c1,2 means "output characters in positions 1 and 2"
I'm not sure what you mean by a "list". If you just want to create a file with the results, use redirection:
cut -c1,2 "$ScriptName" > Starters
If you want to populate an array, just use
while IFS= read -r starter ; do Starters+=("$starter") ; done < <(cut -c1,2 "$ScriptName")
Moreover, if you're interested in letters rather than characters, you can use sed to remove non-letters from each line and then use the solution shown above.
sed 's/[^[:alpha:]]//g' "$ScriptName" | cut -c1,2
Try this Shellcheck-clean (except for a missing initialization of ScriptName) pure Bash code:
Starters=()
while IFS= read -r line || [[ -n $line ]]; do
Starters+=( "${line:0:2}" )
done < "$ScriptName"
See Arrays [Bash Hackers Wiki] for information about using arrays in Bash.
See BashFAQ/001 (How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?)
for information about reading files line-by-line in Bash.
See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) (particularly the bit about "range notation") for an explanation of ${line:0:2}".
The mapfile bash built-in command combined with cut makes it simple:
cut -c1,2 "$ScriptName" | mapfile Starters

Remove lines with japanese characters from a file

First question on here- I've searched around to put together an answer to this but have come up empty thus far.
I have a multi-line text file that I am cleaning up. Part of this is to remove lines that include Japanese characters. I have been using sed for my other operations but it is not working in this instance.
I was under the impression that using the -r switch and the \p{Han} regular expression would work (from looking at other questions of this kind), but it is not working in this case.
Here is my test string - running this returns the full string, and does not filter out the JP characters as I was expecting.
echo 80岁返老还童的处女: 第3话 | sed -r "s/\\p\{Han\}//g"
Am I missing something? Is there another command I should be using instead?
I think this might work for you:
echo "80岁返老还童的处女: 第3话" | tr -cd '[:print:]\n'
sed doesn't support unicode classes AFAIK, and nor support multibyte ranges.
-d deletes characters in SET1, and -c reverses it.
[:print:] matches all printable characters including space.
\n is a newline
The above will not only remove Japanese characters but all multibyte characters, including control characters.
Perl can also be used:
PERLIO=:utf8 perl -pe 's/\p{Han}//g' file
PERLIO=:utf8 tells Perl to tread input and output as UTF-8

Change all non-ascii chars to ascii Bash Scripting

I am trying to write a script that take people names as an arguments and create a folder with their names. But in folder names, the non-ascii chars and whitespaces can sometimes make problem so I want to remove or change them to ascii chars.
I can remove the whitespace between name and surname but I can not figure out how can I change ş->s, ç->c, ğ->g, ı->i, ö->o.
Here is my code :
#!/bin/bash
ARRAY=("$#")
ELEMENTS=${#ARRAY[#]}
for (( i=0;i<$ELEMENTS;i++))
do #C-like for loop syntax
echo ${ARRAY[$i]} | grep "[^ ]*\b" | tr -d ' '
done
I run my script like that myscript.sh 'Çişil Aksoy' 'Cem Dalgıç'
It should change the arguments like : CisilAksoy CemDalgic
Thanks in advance
EDIT :
I found this solution, this does not look very pretty but it works.
sed 's/ş/s/gI; s/ç/c/gI; s/ü/u/gI; s/ö/o/gI; s/ı/i/gI;'
EDIT2 : SOLVED
#!/bin/bash
ARRAY=("$#")
ELEMENTS=${#ARRAY[#]}
for (( i=0;i<$ELEMENTS;i++))
do #C-like for loop syntax
v=$(echo ${ARRAY[$i]} | grep "[^ ]*\b" | tr -d ' ' | sed 's/ş/s/gI; s/ç/c/gI; s/ü/u/gI; s/ö/o/gI; s/ı/i/gI;')
mkdir $v
done
Anything that converts from UTF-8 to ASCII is going to be a compromise.
The iconv program does what was requested (not necessarily satisfying everyone, as in Transliterate any convertible utf8 char into ascii equivalent). Given
Çişil Aksoy' 'Cem Dalgıç
in "foo.txt", and the command
iconv -f UTF8 -t ASCII//TRANSLIT <foo.txt
that would give
Cisil Aksoy' 'Cem Dalg?c
The lynx browser has a different set of ASCII approximations. Using this command
lynx -display_charset=us-ascii -force_html -nolist -dump foo.txt
I get this result:
C,isil Aksoy' 'Cem Dalgic,
Simply put, you can't. ASCII only supports 128 characters.
International characters typically use some variation of Unicode, which can store a much much greater number of characters.
I think your best bet is to identify WHY your folder creation fails when using these characters. Does the method or function not support Unicode? If it does, figure out how to specify that instead of ASCII. If not, you might be stuck with sed and/or tr, which is probably not sustainable.
[UPDATED]
You should be able to substitute multiple characters via tr like follows:
echo şğıö | tr şçğıö scgio
sgio
(I removed my comment from earlier. I tried it on a different server and it worked fine.)

Bash separation of line with newlines instead of spaces

I did two following commands:
With the first one I listed content of directory and stored it in variable.
Second one shows content of variable.
Now I decided that I want to separate listing not with spaces but with newlines, I do the following:
I get a mess. Why?
It's worth to note that when I changed command so, it worked as I wanted:
Could someone please explain, why 0x20 or 32 ( I tried this number too) is not treated Bash as space in this case?
tr simply doesn't recognize hex but octal. This would work:
tr '\040' '\n'
And the easier way to show your files is
shopt -s nullglob ## Optional.
printf '%s\n' *
The problem with tr '\0x20' is, tr is treating all the character sequence as literal characters. And the characters are 0, x, 2. Note all of theese characters were replaced in the output by \n. That's why you have .t instead of txt. Also 2 didn't appear too.
This is not bash, its tr which is making you unhappy. If you really want to iterate over file names there are better ways to do that.
for f in *; do
# do work with $f. But always use quotes. Like `"$f"`
done

Convert string to hexadecimal on command line

I'm trying to convert "Hello" to 48 65 6c 6c 6f in hexadecimal as efficiently as possible using the command line.
I've tried looking at printf and google, but I can't get anywhere.
Any help greatly appreciated.
Many thanks in advance,
echo -n "Hello" | od -A n -t x1
Explanation:
The echo program will provide the string to the next command.
The -n flag tells echo to not generate a new line at the end of the "Hello".
The od program is the "octal dump" program. (We will be providing a flag to tell it to dump it in hexadecimal instead of octal.)
The -A n flag is short for --address-radix=n, with n being short for "none". Without this part, the command would output an ugly numerical address prefix on the left side. This is useful for large dumps, but for a short string it is unnecessary.
The -t x1 flag is short for --format=x1, with the x being short for "hexadecimal" and the 1 meaning 1 byte.
If you want to do this and remove the spaces you need:
echo -n "Hello" | od -A n -t x1 | sed 's/ *//g'
The first two commands in the pipeline are well explained by #TMS in his answer, as edited by #James. The last command differs from #TMS comment in that it is both correct and has been tested. The explanation is:
sed is a stream editor.
s is the substitute command.
/ opens a regular expression - any character may be used. / is
conventional, but inconvenient for processing, say, XML or path names.
/ or the alternate character you chose, closes the regular expression and
opens the substitution string.
In / */ the * matches any sequence of the previous character (in this
case, a space).
/ or the alternate character you chose, closes the substitution string.
In this case, the substitution string // is empty, i.e. the match is
deleted.
g is the option to do this substitution globally on each line instead
of just once for each line.
The quotes keep the command parser from getting confused - the whole
sequence is passed to sed as the first option, namely, a sed script.
#TMS brain child (sed 's/^ *//') only strips spaces from the beginning of each line (^ matches the beginning of the line - 'pattern space' in sed-speak).
If you additionally want to remove newlines, the easiest way is to append
| tr -d '\n'
to the command pipes. It functions as follows:
| feeds the previously processed stream to this command's standard input.
tr is the translate command.
-d specifies deleting the match characters.
Quotes list your match characters - in this case just newline (\n).
Translate only matches single characters, not sequences.
sed is uniquely retarded when dealing with newlines. This is because sed is one of the oldest unix commands - it was created before people really knew what they were doing. Pervasive legacy software keeps it from being fixed. I know this because I was born before unix was born.
The historical origin of the problem was the idea that a newline was a line separator, not part of the line. It was therefore stripped by line processing utilities and reinserted by output utilities. The trouble is, this makes assumptions about the structure of user data and imposes unnatural restrictions in many settings. sed's inability to easily remove newlines is one of the most common examples of that malformed ideology causing grief.
It is possible to remove newlines with sed - it is just that all solutions I know about make sed process the whole file at once, which chokes for very large files, defeating the purpose of a stream editor. Any solution that retains line processing, if it is possible, would be an unreadable rat's nest of multiple pipes.
If you insist on using sed try:
sed -z 's/\n//g'
-z tells sed to use nulls as line separators.
Internally, a string in C is terminated with a null. The -z option is also a result of legacy, provided as a convenience for C programmers who might like to use a temporary file filled with C-strings and uncluttered by newlines. They can then easily read and process one string at a time. Again, the early assumptions about use cases impose artificial restrictions on user data.
If you omit the g option, this command removes only the first newline. With the -z option sed interprets the entire file as one line (unless there are stray nulls embedded in the file), terminated by a null and so this also chokes on large files.
You might think
sed 's/^/\x00/' | sed -z 's/\n//' | sed 's/\x00//'
might work. The first command puts a null at the front of each line on a line by line basis, resulting in \n\x00 ending every line. The second command removes one newline from each line, now delimited by nulls - there will be only one newline by virtue of the first command. All that is left are the spurious nulls. So far so good. The broken idea here is that the pipe will feed the last command on a line by line basis, since that is how the stream was built. Actually, the last command, as written, will only remove one null since now the entire file has no newlines and is therefore one line.
Simple pipe implementation uses an intermediate temporary file and all input is processed and fed to the file. The next command may be running in another thread, concurrently reading that file, but it just sees the stream as a whole (albeit incomplete) and has no awareness of the chunk boundaries feeding the file. Even if the pipe is a memory buffer, the next command sees the stream as a whole. The defect is inextricably baked into sed.
To make this approach work, you need a g option on the last command, so again, it chokes on large files.
The bottom line is this: don't use sed to process newlines.
echo hello | hexdump -v -e '/1 "%02X "'
Playing around with this further,
A working solution is to remove the "*", it is unnecessary for both the original requirement to simply remove spaces as well if substituting an actual character is desired, as follows
echo -n "Hello" | od -A n -t x1 | sed 's/ /%/g'
%48%65%6c%6c%6f
So, I consider this as an improvement answering the original Q since the statement now does exactly what is required, not just apparently.
Combining the answers from TMS and i-always-rtfm-and-stfw, the following works under Windows using gnu-utils versions of the programs 'od', 'sed', and 'tr':
echo "Hello"| tr -d '\42' | tr -d '\n' | tr -d '\r' | od -v -A n -tx1 | sed "s/ //g"
or in a CMD file as:
#echo "%1"| tr -d '\42' | tr -d '\n' | tr -d '\r' | od -v -A n -tx1 | sed "s/ //g"
A limitation on my solution is it will remove all double quotes (").
"tr -d '\42'" removes quote marks that the Windows 'echo' will include.
"tr -d '\r'" removes the carriage return, which Windows includes as well as '\n'.
The pipe (|) character must follow immediately after the string or the Windows echo will add that space after the string.
There is no '-n' switch to the Windows echo command.

Resources