Removing extended ASCII characters. Linux script (128-255) - linux

I want to remove in my text any kind of ASCII character with code in interval [128-255].
gsub(/[^a-z]/, "", $0) --This is how I remove everything except the letters;
gsub(/ē|é|ě|è|ū|ú|ǔ|ù|ǖ|ǘ|ǚ|ǜ|ü|ō|ó|ǒ|ò|ī|í|ǐ|ì|ā|á|ǎ|à|å|ä|â/, "", $0) -- This is how I remove some extended characters, but not every.
gsub(/"[\128-\255]"/, "", $0) I am trying this, but it shows me an error, invalid interval. So, can anybody please help with that problem. Thanks beforehand.

Backslash codes must be in octal, or prefixed with a x and in hexadecimal.
\200-\377
\x80-\xff
Or you could just use strings.

The \nnn syntax is octal (where n is 0-7), so:
\128 = invalid octal
\200 = 128
\255 = 173
\377 = 255
So you want:
\200-\377

Related

Printing a string containing utf-8 byte sequences on perl

I'm new to perl, and I'm trying to print out the folderName from mork files (from Thunderbird).
From: https://github.com/KevinGoodsell/mork-converter/blob/master/doc/mork-format.txt
The second type of special character sequence is a dollar sign
followed by two hexadecimal digits which give the value of the
replacement byte. This is often used for bytes that are non-printable
as ASCII characters, especially in UTF-16 text. For example, a string
with the Unicode snowman character (U+2603):
☃snowman☃
may be represented as UTF-16 text in an Alias this way:
<(83=$03$26s$00n$00o$00w$00m$00a$00n$00$03$26)>
From all the Thunderbird files I've seen it's actually encoded in UTF-8 (2 to 4 bytes).
The following characters need to be escaped (with \) within the string to be used literally: $, ) and \
Example: aaa\$AA$C3$B1b$E2$98$BA$C3$AD\\x08 should print aaa$AAñb☺í\x08
$C3$B1 is ñ; $E2$98$BA is ☺; $C3$ADis í
I tried using the regex to replaced unescaped $ into \x
my $unescaped = qr/(?<!\\)(?:(\\\\)*)/;
$folder =~ s/$unescaped\$/\\x/g;
$folder =~ s/\\([\\$)])/$1/g; # unescape "\ $ ("
Within perl it just prints the literal string.
My workaround is feeding it into bash's printf and it succeeds... unless there's a literal "\x" in the string
$ folder=$(printf "$(mork.pl 8777646a.msf)")
$ echo "$folder"
aaa$AAñb☺í
Questions i consulted:
Convert UTF-8 character sequence to real UTF-8 bytes
But it seems it interprets every byte by itself, not in groups.
In Perl, how can I convert an array of bytes to a Unicode string?
I don't know how to apply this solution to my use case.
Is there any way to achieve this in perl?
The following substitution seems to work for your input:
s/\\([\$\\])|\$(..)/$2 ? chr hex $2 : $1/ge;
Capture \$ or \\, if matched, replace them with $ or \. Otherwise, capture $.. and convert to the corresponding byte.
If you want to work with the result in Perl, don't forget to decode it from UTF-8.
$chars = decode('UTF-8', $bytes);

Enter Key string.byte number in Lua

I use this to detect space in a string in Lua:
if string.byte(" ")==32 then blah blah
What is the return number (instead of 32) for enter key or new line in Lua?
These numbers denote the ASCII codes for each character. Here's a chart for future reference (but only to 127, as extended ASCII is not supported) so newline is 10.
You can also print a list with the following code:
for i=1,127 do
print(i .. " = " .. string.char(i))
end
However, command characters (such as newline) are difficult to interpret.
You can check them with the \n and \r characters.
> =string.byte '\r'
13
> =string.byte '\n'
10
I don't know the number, but you could try finding it by printing print(string.byte("\n"))

New to Perl and was wondering why my code isn't doing what it's supposed to

I have an assignment asking me to print x iterations of a string for each character in that string. So if the string input is "Gum", then it should print out:
Gum
Gum
Gum
Right now my code is
my $string = <>;
my $length = length($string);
print ($string x $length, "\n");
And I'm getting gum printed five times as my output.
Those who have said you will get CR + LF at the end of the line on a Windows system are mistaken. Perl will convert the native line ending to a simple newline \n on any platform.
You must bear this in mind whether you are reading from the terminal or from a file.
The built-in chomp function will remove the line terminator character from the end of a string variable. If the string doesn't end with a line terminator then it will have no effect.
So when you type GumEnter you are setting $string to "Gum\n", and length will show that it has four characters.
You are seeing it five times on your screen because the first line is what you typed in yourself. The following four are printed by the program.
After a chomp, $string is just "Gum" with a length of three characters, which is what you want.
To output this on separate lines you have to print a newline after each line, so you can write
my $string = <>;
chomp $string;
my $length = length $string;
print ("$string\n" x $length);
or perhaps
print $string, "\n" for 1 .. $length;
I hope that helps
As you are simply using the input string, it still contains the newline at the end. This is also counted as a character. On my system, it outputs 4 Gum\n.
chomp($string) will remove the line ending, but the output will then also run together, resulting in GumGumGum\n
When You insert input and press enter afterwards You don't enter "Gum" but "Gum\r\n" which is a string of length 5. You should do trimming.
Your code is working fine. See this: http://ideone.com/AsPFh3
Possibility 1: It might be that you're putting 2 spaces while giving input from command line, that's why the length comes out to be 5, and it prints 5 times. Something like this: http://ideone.com/fsvnrd
In above case the my $string=<>; will give you my $string = "gum "; so length will be 5.
Possibility 2:
Another possibility is that if you use Windows then it will add carriage return (\r) and new line (due to enter \n) at the end of string. So it makes the length 5.
Edit: To print in new line: Use the below code.
#!/usr/bin/perl
# your code goes here
chomp(my $string=<>);
my $length = length($string);
print ("$string\n" x $length);
Demo
Edit 2: To remove \r\n use the below:
$string=~ s/\r|\n//g; Read more here.

Remove escapes from a string, or, "how can I get \ out of the way?"

Escape characters cause a lot of trouble in R, as evidenced by previous questions:
Change the values in a column
Can R paste() output "\"?
Replacing escaped double quotes by double quotes in R
How to gsub('%', '\%', ... in R?
Many of these previous questions could be simplified to special cases of "How can I get \ out of my way?"
Is there a simple way to do this?
For example, I can find no arguments to gsub that will remove all escapes from the following:
test <- c('\01', '\\001')
The difficulty here is that "\1", although it's printed with two glyphs, is actually, in R's view a single character. And in fact, it's the very same character as "\001" and "\01":
nchar("\1")
# [1] 1
nchar("\001")
# [1] 1
identical("\1", "\001")
# [1] TRUE
So, you can in general remove all backslashes with something like this:
(test <- c("\\hi\\", "\n", "\t", "\\1", "\1", "\01", "\001"))
# [1] "\\hi\\" "\n" "\t" "\\1" "\001" "\001" "\001"
eval(parse(text=gsub("\\", "", deparse(test), fixed=TRUE)))
# [1] "hi" "n" "t" "1" "001" "001" "001"
But, as you can see, "\1", "\01", and \001" will all be rendered as 001, (since to R they are all just different names for "\001").
EDIT: For more on the use of "\" in escape sequences, and on the great variety of characters that can be represented using them (including the disallowed nul string mentioned by Joshua Ulrich in a comment above), see this section of the R language definition.
I just faced the same issue - if you want any \x where x is a character then I am not sure how, I wish I knew, but to fix it for a specific escape sequence,. say \n then you can do
new = gsub("\n","",old,fixed=T)
in my case, I only had \n

Unicode-aware strings(1) program

Does anybody have a code sample for a unicode-aware strings program? Programming language doesn't matter. I want something that essentially does the same thing as the unix command "strings", but that also functions on unicode text (UTF-16 or UTF-8), pulling runs of english-language characters and punctuation. (I only care about english characters, not any other alphabet).
Thanks!
Do you just want to use it, or do you for some reason insist on the code?
On my Debian system, it seems strings command can do this out of the box. See the exercept from the manpage:
--encoding=encoding
Select the character encoding of the strings that are to be found. Possible values for encoding are: s = single-7-bit-byte characters (ASCII, ISO 8859,
etc., default), S = single-8-bit-byte characters, b = 16-bit bigendian, l = 16-bit littleendian, B = 32-bit bigendian, L = 32-bit littleendian. Useful
for finding wide character strings.
Edit: OK. I don't know C# so this may be a bit hairy, but basically, you need to search for sequences of alternating zeros and English characters.
byte b;
int i=0;
while(!endOfInput()) {
b=getNextByte();
LoopBegin:
if(!isEnglish(b)) {
if(i>0) // report successful match of length i
i=0;
continue;
}
if(endOfInput()) break;
if((b=getNextByte())!=0)
goto LoopBegin;
i++; // found another character
}
This should work for little-endian.
I had a similar problem and tried the "strings -e ..." but I just found options for fix width chars encoding. (UTF-8 encoding is variable width).
Remeber thar by default characters outside ascii need extra strings options. This includes almost all non English language strings.
Nevertheless "-e S" (single 8 bits chars) output includes UTF-8 chars.
I wrote a very simple (opinion-ed) Perl script that applies a
"strings -e S ... | iconv ..." to the input files.
I believe it is easy to tune it for specific restrictions.
Usage: utf8strings [options] file*
#!/usr/bin/perl -s
our ($all,$windows,$enc); ## use -all ignore the "3 letters word" restriction
use strict;
use utf8::all;
$enc = "ms-ansi" if $windows; ##
$enc = "utf8" unless $enc ; ## defaul encoding=utf8
my $iconv = "iconv -c -f $enc -t utf8 |";
for (#ARGV){ s/(.*)/strings -e S '$1'| $iconv/;}
my $word=qr/[a-zçáéíóúâêôàèìòùüãõ]{3}/i; # adapt this to your case
while(<>){
# next if /regular expressions for common garbage/;
print if ($all or /$word/);
}
In some situations, this approach produce some extra garbage.

Resources