bash and awk extract string at specific position in non-utf file - linux

I've a file foo.txt that is encoded with chatset ISO-8859-1.
I am doing some field extraction with awk, based in a specific position.
e.g at each line, extract a string that starts in pos 10 with length 5.
That is a simple task, however the below command has different behaviors in different Linux Machines (with different bash/awk versions).
In Machine 1 OK, Machine 2 NOT ok:
cat foo.dat | iconv -f ISO-8859-1 -t UTF-8 | awk '{print substr($0, 10,5)}' > results.utf8
In Machine 1 NOT ok, Machine 2 OK:
cat foo.dat | awk '{print substr($0, 10,5)}' | iconv -f ISO-8859-1 -t UTF-8 > results.utf8
If I run the same command with the same input file, the results are different on each line that contains a "non-utf" char like (a▒c) before the 'cut' position".
No idea where the issue is, linux Kernel, bash or awk version... and specially how to have a common way to extract the desired strings...

No idea where the issue is, linux Kernel, bash or awk version...
The GNU Awk User's Guide - Bytes vs. Characters claims that
The POSIX standard requires that awk function in terms of characters,
not bytes. Thus in gawk, length(), substr(), split(),
match() and the other string functions (...) all work in terms of
characters in the local character set, and not in terms of bytes. (Not
all awk implementations do so, though).
If above hold true then answer how to have a common way to extract the desired strings is to use AWK implementation compliant with POSIX (or at least who respect above rule to work in terms of characters, not bytes) and to make sure local character set is as desired.

One option is to use a language which only has one implementation and where you can turn off UTF-8 (or rather, fail to turn it on).
It's not entirely clear what you expect the output to be, but I'm guessing you want something like this:
perl -lne 'print substr($_, 9, 5)' foo.dat | iconv -f ISO-8859-1 -t UTF-8
Notice how the conversion only happens after the extraction, so you can be sure that each byte is exactly one character.

Related

How to truncate rest of the text in a file after finding a specific text pattern, in unix?

I have a HTML PAGE which I have extracted in unix using wget command, in that after the word "Check list" I need to remove all of the text and with the remaining I am trying to grep some data. I am unable to think on a way which can be helpful for removing the text after a keyword. if I do
s/Check list.*//g
It just removes the line , I want everything below that to be gone. How do I perform this?
The other solutions you have so far require non-POSIX-mandatory tools (GNU sed, GNU awk, or perl) so YMMV with their availability and will read the whole file into memory at once.
These will work in any awk in any shell on every Unix box and only read 1 line at a time into memory:
awk -F 'Check list' '{print $1} NF>1{exit}' file
or:
awk 'sub(/Check list.*/,""){f=1} {print} f{exit}' file
With GNU awk for multi-char RS you could do:
awk -v RS='Check list' '{print; exit}' file
but that would still read all of the text before Check list into memory at once.
Depending on which sed version you have, maybe
sed -z 's/Check list.*//'
The /g flag is useless as you only want to replace everything once.
If your sed does not have the -z option (which says to use the ASCII null character as line terminator instead of newline; this hinges on your file not containing any actual nulls, but that should trivially be true for any text file), try Perl:
perl -0777 -pe 's/Check list.*//s'
Unlike sed -z, this explicitly says to slurp the entire file into memory (the argument to -0 is the octal character code of a terminator character, but 777 is not a valid terminator character at all, so it always reads the entire file as a single "line") so this works even if there are spurious nulls in your file. The final s flag says to include newline in what . matches (otherwise s/.*// would still only substitute on the matching physical line).
I assume you are aware that removing everything will violate the integrity of the HTML file; it needs there to be a closing tag for every start tag near the beginning of the document (so if it starts with <html><body> you should keep </body></html> just before the end of the file, for example).
With awk you could make use of RS variable and then set field separator to regex with word boundaries and then print the very first field as per need.
awk -v RS="^$" -v FS='\\<check_list\\>' '{print $1}' Input_file
You might use q to instruct GNU sed to quit, thus ending processing, consider following simple example, let file.txt content be
123
456
789
and say you want to jettison everything beyond 5, then you could do
sed '/5/{s/5.*//;q}' file.txt
which gives output
123
4
Explanation: for line having 5, substitute 5 and everything beyond it with empty string (i.e. delete it), then q. Observe that lowercase q is used to provide printing of altered line before quiting.
(tested in GNU sed 4.7)

sed doesn't remove characters from UTF range properly

I want to clear my file from all characters except russian and arabic letters, "|" and space mark. Lets start with only arabic letters. So I have:
cat file.tzt | sed 's/[^\u0600-\u06FF]//g'
sed: -e expression #1, char 21: Invalid range end.
I have tried [\u0621-\u064A] - same.
I also tried to use {Arabic}, but it doesn't clean files properly at all.
Error looks kinda strange for me. Obviously, 064FF > 0621.
So, overall I want to have something like this:
cat file.tzt | sed 's/[^\u0600-\u06FFа-яА-Я |]//g'
And I am ok with awk or any other utility, but as I know sed is stable and reliable.
Perl understands UTF-8:
perl -CSD -pe 's/[^\N{U+0600}-\N{U+06FF}]//g' -- file.txt
-C turns of UTF-8 support, S means for stdin/stdout/stderr, D means for any i/o streams.
You can also use Unicode properties:
s/\P{Cyrillic}//g

Change all non-ascii chars to ascii Bash Scripting

I am trying to write a script that take people names as an arguments and create a folder with their names. But in folder names, the non-ascii chars and whitespaces can sometimes make problem so I want to remove or change them to ascii chars.
I can remove the whitespace between name and surname but I can not figure out how can I change ş->s, ç->c, ğ->g, ı->i, ö->o.
Here is my code :
#!/bin/bash
ARRAY=("$#")
ELEMENTS=${#ARRAY[#]}
for (( i=0;i<$ELEMENTS;i++))
do #C-like for loop syntax
echo ${ARRAY[$i]} | grep "[^ ]*\b" | tr -d ' '
done
I run my script like that myscript.sh 'Çişil Aksoy' 'Cem Dalgıç'
It should change the arguments like : CisilAksoy CemDalgic
Thanks in advance
EDIT :
I found this solution, this does not look very pretty but it works.
sed 's/ş/s/gI; s/ç/c/gI; s/ü/u/gI; s/ö/o/gI; s/ı/i/gI;'
EDIT2 : SOLVED
#!/bin/bash
ARRAY=("$#")
ELEMENTS=${#ARRAY[#]}
for (( i=0;i<$ELEMENTS;i++))
do #C-like for loop syntax
v=$(echo ${ARRAY[$i]} | grep "[^ ]*\b" | tr -d ' ' | sed 's/ş/s/gI; s/ç/c/gI; s/ü/u/gI; s/ö/o/gI; s/ı/i/gI;')
mkdir $v
done
Anything that converts from UTF-8 to ASCII is going to be a compromise.
The iconv program does what was requested (not necessarily satisfying everyone, as in Transliterate any convertible utf8 char into ascii equivalent). Given
Çişil Aksoy' 'Cem Dalgıç
in "foo.txt", and the command
iconv -f UTF8 -t ASCII//TRANSLIT <foo.txt
that would give
Cisil Aksoy' 'Cem Dalg?c
The lynx browser has a different set of ASCII approximations. Using this command
lynx -display_charset=us-ascii -force_html -nolist -dump foo.txt
I get this result:
C,isil Aksoy' 'Cem Dalgic,
Simply put, you can't. ASCII only supports 128 characters.
International characters typically use some variation of Unicode, which can store a much much greater number of characters.
I think your best bet is to identify WHY your folder creation fails when using these characters. Does the method or function not support Unicode? If it does, figure out how to specify that instead of ASCII. If not, you might be stuck with sed and/or tr, which is probably not sustainable.
[UPDATED]
You should be able to substitute multiple characters via tr like follows:
echo şğıö | tr şçğıö scgio
sgio
(I removed my comment from earlier. I tried it on a different server and it worked fine.)

sed returning different result on different platforms

Hi using following command on an x86 machine (using /bin/sh) returns: <port>3<port>
test="port 3"
echo $test | sed -r 's/\s*port\s*([0-9]+)\s*/<port>\1<\/port>/'
but running same command on sh shell of an ARM based network switch returns the string port 3.
How can I get same result on switch as I got on my x86 machine? To me it seems like digit is not being captured by [0-9].
\s is a GNU sed extension to the standard sed behavior. GNU sed is the implementation on desktop/server Linux systems. Most embedded Linux systems run BusyBox, a suite of utilities with a markedly smaller footprint and fewer features.
A standard way of specifying “any space character” is the [:space:] character class. It is supported by BusyBox (at least, by most BusyBox installations; most BusyBox features can be stripped off for an even lower footprint).
BusyBox also doesn't support the -r option, you need to use a basic regular expression. In a BRE, \(…\) marks groups, and there is no + operator, only *.
echo "$test" | sed 's/[[:space:]]*port[[:space:]]*\([0-9][0-9]*\)[[:space:]]*/<port>\1<\/port>/'
Note that since you didn't put any quotes around $test, the shell performed word splitting and wildcard expansion on the value of the variable. That is, the value of the variable was treated as a whitespace-separated list of file names which were then joined by a single space. So if you leave out the quotes, you don't have to worry about different kinds of whitespace, you can write echo $test | sed 's/ *port *([0-9][0-9]*) */<port>\1<\/port>/'. However, if $test had been port *, the result would have depended on what files exist in the current directory.
Not all seds support reg-expression short-hand like \s. A more portable version is
test="port 3"
echo "$test" | sed -r 's/[ ]*port[ ]*([0-9]+)[ ]*/<port>\1<\/port>/'
If you really need to check for tab chars as well, just add them to the char class (in all 3 places) that, in my example just contain space chars, i.e. the [ ] bit.
output
<port>3</port>
I hope this helps.

Convert string to hexadecimal on command line

I'm trying to convert "Hello" to 48 65 6c 6c 6f in hexadecimal as efficiently as possible using the command line.
I've tried looking at printf and google, but I can't get anywhere.
Any help greatly appreciated.
Many thanks in advance,
echo -n "Hello" | od -A n -t x1
Explanation:
The echo program will provide the string to the next command.
The -n flag tells echo to not generate a new line at the end of the "Hello".
The od program is the "octal dump" program. (We will be providing a flag to tell it to dump it in hexadecimal instead of octal.)
The -A n flag is short for --address-radix=n, with n being short for "none". Without this part, the command would output an ugly numerical address prefix on the left side. This is useful for large dumps, but for a short string it is unnecessary.
The -t x1 flag is short for --format=x1, with the x being short for "hexadecimal" and the 1 meaning 1 byte.
If you want to do this and remove the spaces you need:
echo -n "Hello" | od -A n -t x1 | sed 's/ *//g'
The first two commands in the pipeline are well explained by #TMS in his answer, as edited by #James. The last command differs from #TMS comment in that it is both correct and has been tested. The explanation is:
sed is a stream editor.
s is the substitute command.
/ opens a regular expression - any character may be used. / is
conventional, but inconvenient for processing, say, XML or path names.
/ or the alternate character you chose, closes the regular expression and
opens the substitution string.
In / */ the * matches any sequence of the previous character (in this
case, a space).
/ or the alternate character you chose, closes the substitution string.
In this case, the substitution string // is empty, i.e. the match is
deleted.
g is the option to do this substitution globally on each line instead
of just once for each line.
The quotes keep the command parser from getting confused - the whole
sequence is passed to sed as the first option, namely, a sed script.
#TMS brain child (sed 's/^ *//') only strips spaces from the beginning of each line (^ matches the beginning of the line - 'pattern space' in sed-speak).
If you additionally want to remove newlines, the easiest way is to append
| tr -d '\n'
to the command pipes. It functions as follows:
| feeds the previously processed stream to this command's standard input.
tr is the translate command.
-d specifies deleting the match characters.
Quotes list your match characters - in this case just newline (\n).
Translate only matches single characters, not sequences.
sed is uniquely retarded when dealing with newlines. This is because sed is one of the oldest unix commands - it was created before people really knew what they were doing. Pervasive legacy software keeps it from being fixed. I know this because I was born before unix was born.
The historical origin of the problem was the idea that a newline was a line separator, not part of the line. It was therefore stripped by line processing utilities and reinserted by output utilities. The trouble is, this makes assumptions about the structure of user data and imposes unnatural restrictions in many settings. sed's inability to easily remove newlines is one of the most common examples of that malformed ideology causing grief.
It is possible to remove newlines with sed - it is just that all solutions I know about make sed process the whole file at once, which chokes for very large files, defeating the purpose of a stream editor. Any solution that retains line processing, if it is possible, would be an unreadable rat's nest of multiple pipes.
If you insist on using sed try:
sed -z 's/\n//g'
-z tells sed to use nulls as line separators.
Internally, a string in C is terminated with a null. The -z option is also a result of legacy, provided as a convenience for C programmers who might like to use a temporary file filled with C-strings and uncluttered by newlines. They can then easily read and process one string at a time. Again, the early assumptions about use cases impose artificial restrictions on user data.
If you omit the g option, this command removes only the first newline. With the -z option sed interprets the entire file as one line (unless there are stray nulls embedded in the file), terminated by a null and so this also chokes on large files.
You might think
sed 's/^/\x00/' | sed -z 's/\n//' | sed 's/\x00//'
might work. The first command puts a null at the front of each line on a line by line basis, resulting in \n\x00 ending every line. The second command removes one newline from each line, now delimited by nulls - there will be only one newline by virtue of the first command. All that is left are the spurious nulls. So far so good. The broken idea here is that the pipe will feed the last command on a line by line basis, since that is how the stream was built. Actually, the last command, as written, will only remove one null since now the entire file has no newlines and is therefore one line.
Simple pipe implementation uses an intermediate temporary file and all input is processed and fed to the file. The next command may be running in another thread, concurrently reading that file, but it just sees the stream as a whole (albeit incomplete) and has no awareness of the chunk boundaries feeding the file. Even if the pipe is a memory buffer, the next command sees the stream as a whole. The defect is inextricably baked into sed.
To make this approach work, you need a g option on the last command, so again, it chokes on large files.
The bottom line is this: don't use sed to process newlines.
echo hello | hexdump -v -e '/1 "%02X "'
Playing around with this further,
A working solution is to remove the "*", it is unnecessary for both the original requirement to simply remove spaces as well if substituting an actual character is desired, as follows
echo -n "Hello" | od -A n -t x1 | sed 's/ /%/g'
%48%65%6c%6c%6f
So, I consider this as an improvement answering the original Q since the statement now does exactly what is required, not just apparently.
Combining the answers from TMS and i-always-rtfm-and-stfw, the following works under Windows using gnu-utils versions of the programs 'od', 'sed', and 'tr':
echo "Hello"| tr -d '\42' | tr -d '\n' | tr -d '\r' | od -v -A n -tx1 | sed "s/ //g"
or in a CMD file as:
#echo "%1"| tr -d '\42' | tr -d '\n' | tr -d '\r' | od -v -A n -tx1 | sed "s/ //g"
A limitation on my solution is it will remove all double quotes (").
"tr -d '\42'" removes quote marks that the Windows 'echo' will include.
"tr -d '\r'" removes the carriage return, which Windows includes as well as '\n'.
The pipe (|) character must follow immediately after the string or the Windows echo will add that space after the string.
There is no '-n' switch to the Windows echo command.

Resources