Interpolating ASCII with utf8 gives error in open() - string

As stated in the title, the problem seems to be that I have one string read from an ASCII file, and another that is utf8; when I use interpolation to form a string, and then pass that string to open(), it seems to get munged, and I get an error. Here is a minimal example:
#!/usr/bin/perl
use open ":encoding(utf8)";
use strict;
open (FILE,"<u");
my $p = <FILE>;
$p =~ s/\s+$//;
close FILE;
print "p=",$p,"\n";
if ($p eq "cat") {print "yes\n"} else {"no\n"}
my $file = "påminnelser"; # note the circle over the "a"
my $x = "$p <$file |";
print "x=$x\n";
open (FILE, $x);
close FILE;
It seems to make a difference that the string $p is read from the external file u, which looks like this:
cat
My code is utf8, while file u is ASCII, according to the 'file' utility:
---- rintintin a $ file u
u: ASCII text
---- rintintin a $ file bug.pl
bug.pl: Perl script, UTF-8 Unicode text executable
The result looks like this:
---- rintintin a $ ./bug.pl
p=cat
yes
x=cat <påminnelser |
sh: 1: cannot open påminnelser: No such file
The filename has been munged somewhere inside the call to open(). Although $p eq "cat" is true, if I simply set $p="cat" in the code rather than reading it from the file, the error goes away. I would guess that this is because my source code file is utf8.
Can anyone explain what is happening here and how to fix it?
[EDIT] As described in my comment on Dmitri Chubarov's answer, it turns out that my minimal example actually didn't correctly represent the bug in my original program. This question describes the actual bug: Should perl's File::Glob always be post-filtered through utf8::decode?

You should add
use utf8;
pragma to your script in order for the Perl source text be interpreted as UTF8.
By default Perl source is interpreted as a stream of bytes, therefore the
my $file = "påminnelser"
is turned into a string of bytes that is interpreted according to the default encoding.

Related

Bash. How to convert UTF-8 to hex encode?

I have got one variable with string of UTF-8 text. I want to get string like \xAA\xBB\xCC or, it seems to be encoded as \Uxxxxxxxx or some such... How can I to realize it?
I could to do it with Python3 (.7):
def stou(x):
s = ''
for i in x:
s = s + '\\U' + hex(ord(i))[2:]
return s
But I'd like to resolve it by native bash methods and (or) by standard, almost native utils of Linux, like base64 or find. I'm just trying to create file server and in usual format I have problems with space-chars. So I try to find some another method to keep it.
Using perl:
$ echo -ne "12345 =\n= me + Дварфы" | perl -0777 -CS -nE 'say map { sprintf "\\U%x", $_ } unpack "U*"'
\U31\U32\U33\U34\U35\U20\U3d\Ua\U3d\U20\U6d\U65\U20\U2b\U20\U414\U432\U430\U440\U444\U44b
Basically, reads all of its standard input as one UTF-8 encoded chunk, converts each codepoint to a number, and prints them out in base 16 with a leading \U before each one.

Perl Unable to truncate string

I am trying to extract AAA and BBB from the output of the command "dspmq".
$dspmq <- this command gives output as -->
QMNAME(AAA) STATUS(Running)
QMNAME(BBB) STATUS(Running)
But it doesn't work with the below code.
perl -e 'use Data::Dumper qw(Dumper);my #qmgrlist = `dspmq`;$size = #qmgrlist;foreach my $i (#qmgrlist){my #temp1 = split /QMNAME\(/, $i;print #temp1;}'
AAA) STATUS(Running)
BBB) STATUS(Running)
I am able to truncate "QMNAME(" but unable to truncate those to the right of AAA and BBB. Basically I want to get the string between "QMNAME(" and the immediate ")". Please assist.
I think a regex approach is better than split() here, but you could use split() by splitting on parentheses and taking the second item in the returned list.
for (#qmgrlist) {
say +(split /[()]/)[0];
}
And a brief note on your use of command-line options to run this code. You can make it simpler if you a) pipe the output of qspmq into your code and b) use -n to process a record at a time.
$ perl -nE 'say +(split /[()]/)[1]' `dspmq`
There's also -M to load modules (e.g. -MData::Dumper), but you don't seem to be using Data::Dumper any more.
split isn't going to do what you need. I would just use a regular expression to match the sub-string you need
So change the loop from this
foreach my $i (#qmgrlist)
{
my #temp1 = split /QMNAME\(/, $i;
print #temp1;
}
to this
foreach my $i (#qmgrlist)
{
print "$1\n"
if /QMNAME\((.+?)\)/;
}
Try this perl one-liner:
dspmq | perl -lne 'print for m{ QMNAME [(] ( [^)]* ) [)] }x'
Here, dspmq STDOUT is fed using a pipe | into STDIN of the perl code, which has these flags:
-e tells Perl interpreter to look for the code inline rather than in a separate script file.
-n feeds the input line by line to the inline code (this way you do not need to store the output in an array - this matters for large outputs, not in your case).
-l strips the input record separator (newline on *NIX) before feeding it to the code, and appends it automatically after during print.
The print ... for ... m{... (...) ...} code prints every pattern captured in parentheses.
The captured pattern is [^)]*, which is maximum number (0 or more) chars that are not (^) listed in the character class, that is, that are not closing parens.
[(] ... [)] are literal parentheses escaped as character classes for readability. I prefer this to escaping like so: \( ... \).
QMNAME is used to make the programmer's intentions clear: you want the string that follows QMNAME in parens. I prefer this to using the field index, such as 1, which protects you against minor variation in output of your command used with different options, on different systems, etc.
Finally, the x regex modifier in m{...}x enables comments and whitespace to be ignored, and is preferred for readability.
RELATED:
Cutting the output of a dspmq command
Desired output can be achieved with following code
use strict;
use warnings;
use feature 'say';
map{ say $1 if /QMNAME\((.+?)\)/ } <DATA>;
__DATA__
QMNAME(AAA) STATUS(Running)
QMNAME(BBB) STATUS(Running)
output
AAA
BBB
and one liner (not tested - I am on Windows computer)
dspmq | perl -lne 'print $1 if /QMNAME\((.+?)\)/'

Bash: remove substrings using ${string//substr/rep}

I am trying to write a small shell script, which can read a text file (given as argument), deleting all invalid Base64 chars and then decode this Base64 String into readable Text.
For this Example i can assume, that i have got a valid Base64 String polluted with additional invalid chars. So simply deleting them makes the String valid again.
I am having problems with the "remove al invalid chars" part.
Here is my Script:
#!/bin/bash
args=("$#")
#echo ${args[0]}
# read file
STRING="$(cat ${args[0]})"
echo "Input:"
echo $STRING
echo "\n"
#BASE64_REGEX='!/[^A-Za-z0-9+\/=]/'
STRING=${STRING//[!?_-]/}
echo "Fixed:"
echo $STRING
echo "\n"
# decode String
DECODED=$(base64 -d <<< "$STRING")
echo "Decoded:"
echo $DECODED
echo "\n"
I think my problem is this part here STRING=${STRING//[!?_-]/}. After this Operation the String contains ??___--- + linebreak, so i must somehow be close.
EDIT:
This would be the example String. And i try to remove all Characters, which are NOT part of the Base64 alphapet.
!RGllIGVpbnppZ2VuIFNvbmRlc??nplaWNoZW4gaW0gQmFzZTY0IEFscGhhYmV0IHNpbmQgIisg_L_y_A9Ii4gQWxsZSB3ZWl0ZXJlbi-B-T-b25kZXJ6!ZWljaGVuICIhIsKnJCUiIGtvbW1!lbiBkb3J0IG5pY2h0IHZvci"4=
Thanks for your help!
It' because ! in first position in a character set invert the set like ^ (note: only true for pattern matching (glob) not regex matching, but in this case it's just pattern matching)
maybe you want
STRING=${STRING//[?\!_-]/}
why not use the set in comments
STRING=${STRING//[^A-Za-z0-9+\/=]/}

Convert seconds to hh:mm:ss format (or whatever format Excel or LibreOffice likes) and insert it back to a csv file in Bash

I have a csv file like this:
ELAPSEDTIME_SEC;CPU_%;RSS_KBYTES
0;3.4;420012
1;3.4;420012
2;3.4;420012
3;3.4;420012
4;3.4;420012
5;3.4;420012
And I'd like to convert the values (they are seconds) in the first column to hh:mm:ss format (or whatever Excel or LibreOffice can import as time format from csv) and insert it back to the file into a new column following the first. So the output would be something like this:
ELAPSEDTIME_SEC;ELAPSEDTIME_HH:MM:SS;CPU_%;RSS_KBYTES
0;0:00:00;3.4;420012
1;0:00:01;3.4;420012
2;0:00:02;3.4;420012
3;0:00:03;3.4;420012
4;0:00:04;3.4;420012
5;0:00:05;3.4;420012
And I'd have to do this in Bash to work under Linux and OS X as well.
I hope this is what you want:
TZ=UTC awk -F';' -vOFS=';' '
{
$1 = $1 ";" (NR==1 ? "ELAPSEDTIME_HH:MM:SS" : strftime("%H:%M:%S", $1))
}1' input.csv
By thinking about your question I found an interesting manipulation possibility: Insert a formula into the CSV, and how to pass it to ooCalc:
cat se.csv | while read line ; do n=$((n+1)) ; ((n>1)) && echo ${line/;/';"=time(0;0;$A$'$n')";'} ||echo ${line/;/;time of A;} ;done > se2.csv
formatted:
cat se.csv | while read line ; do
n=$((n+1))
((n>1)) && echo ${line/;/';"=time(0;0;$A$'$n')";'} || echo ${line/;/;time of A;}
done > se2.csv
Remarks:
This adds a column - it doesn't replace
You have to set the import options for CSV correctly. In this case:
delimiter = semicolon (well, we had to do this for the original file as well)
text delimiter = " (wasn't the default)
deactivate checkbox "quoted field as text"
depending on your locale, the function name has to be translated. For example, in German I had to use "zeit(" instead of "time("
since formulas use semicolons themselves the approach will be simpler, not needing that much masking, if the delimiter is something else, maybe a tab.
In practice, you might treat the headline like all the other lines, and correct it manually in the end, but the audience of SO expects everything to work out of the box, so the command became something longer.
I would have preferred to replace the whole while read / cat/ loop thing with just a short sed '...' command, and I found a remark in the man page of sed, that = can be used for the rownum, but I don't know how to handle it.
Result:
cat se2.csv
ELAPSEDTIME_SEC;time of A;CPU_%;RSS_KBYTES
0;"=time(0;0;$A$2)";3.4;420012
1;"=time(0;0;$A$3)";3.4;420012
2;"=time(0;0;$A$4)";3.4;420012
3;"=time(0;0;$A$5)";3.4;420012
4;"=time(0;0;$A$6)";3.4;420012
5;"=time(0;0;$A$7)";3.4;420012
In this specific case, the awk-solution seems better, but I guess this approach might sometimes be useful to know.

Best way to check if argument is a filename or a file containing a list of filenames?

I'm writing a Perl script, and I'd like a way to have the user enter a file or a file containing a list of files in $ARGV[0].
The current way that I'm doing it is to check if the filename starts with an #, if it does, then I treat that file as a list of filenames.
This is definitely not the ideal way to do it, because I've noticed that # is a special character in bash (What does it do by the way? I've only seen it used in $# in bash).
You can specify additional parameter on your command line to treat it differenly e.g.
perl script.pl file
for reading file's content, or
perl script.pl -l file
for reading list of files from file.
You can use getopt module for easier parsing of input arguments.
First, you could use your shell to grab the list for you:
perl script.pl <( cat list )
If you don't want to do that, perhaps because you are running against the maximum command line length, you could use the following before you use #ARGV or ARGV (including <>):
#ARGV = map {
if (my $qfn = /^\#(.*)/s) {
if (!open(my $fh, '<', $qfn)) {
chomp( my #args = <$fh> );
#args
} else {
warn("Can't open $qfn: $!\n");
()
}
} else {
$_
}
} #ARGV;
Keep in mind that you'll have unintended side effects if you have a file whose name starts with "#".
'#' is special in Perl, so you need to escape it in your Perl strings--unless you use the non-interpolating string types of 'a non-interpolating $string' or q(another non-interpolating $string) or you need to escape it, like so
if ( $arg =~ /^\#/ ) {
...
}
Interpolating delimiters are any of the following:
"..." or qq/.../
`...` or qx/.../
/.../ or qr/.../
For all those, you will have to escape any literal #.
Otherwise, a filename starting with a # has pretty good precedence in command line arguments.

Resources