How to slip gather-take in lazy manner into map? - memory-leaks

I need to construct following flow:
Accept list of file names
Extract multiple lines from those files
Process those lines
However I have no idea how to properly inject gather-take into map:
sub MAIN ( *#file-names ) {
#file-names.map( { slip parse-file( $_ ) } ).map( { process-line( $_ ) } );
}
sub parse-file ( $file-name ) {
return gather for $file-name.IO.lines -> $line {
take $line if $line ~~ /a/; # dummy example logic
}
}
sub process-line ( $line ) {
say $line; # dummy example logic
}
This code works but leaks memory like crazy. I assume slip makes gather-take eager? Or slip does not mark Seq items as consumed? Is there a way to slip gather-take result into map in lazy manner?
BTW: My intent is to parallelize each step with race later - so for example I have 2 files parsed at the same time producing lines for 10 line processors. Generally speaking I'm trying to figure out easiest way of composing such cascade flows. I've tried Channels to connect each processing step but they have no built-in pushback. If you have any other patterns for such flows then comments are more than welcomed.
EDIT 1:
I think my code is correct, and memory leak is not caused by bad logic but rather by bug in Slip class. I've created issue https://github.com/rakudo/rakudo/issues/5138 that is currently open. I'll post an update once it is resolved.
EDIT 2:
No, my code was not correct :) Check for my post for answer.

I believe that you are mistaken about the cause of the non-laziness in your code – in general, using slip should not typically make code eager. And, indeed, when I run the slightly modified version of your code shown below:
sub MAIN () {
my #file-names = "tmp-file000".."tmp-file009";
spurt $_, ('a'..'z').join("\n") for #file-names;
my $parsed = #file-names.map( { slip parse-file( $_ ) } );
say "Reached line $?LINE";
$parsed.map( { process-line( $_ ) } );
}
sub parse-file ( $file-name ) {
say "processing $file-name...";
gather for $file-name.IO.lines -> $line {
take $line if $line ~~ /a/; # dummy example logic
}
}
sub process-line ( $line ) {
say $line; # dummy example logic
}
I get the output that shows Raku processing the files lazily (note that it does not call parse-file until it needs to pass new values to process-line):
Reached 8
processing tmp-file000...
a
processing tmp-file001...
a
processing tmp-file002...
a
processing tmp-file003...
a
processing tmp-file004...
a
processing tmp-file005...
a
processing tmp-file006...
a
processing tmp-file007...
a
processing tmp-file008...
a
processing tmp-file009...
a
Since I don't have the rest of your code, I'm not sure what is triggering the non-lazy behavior you're observing. In general, if you have code that is being eagerly evaluated when you want it to be lazy, though, the .lazy method and/or the lazy statement prefixes are good tools.
Finally, a couple of minor notes about the code you posted that aren't relevant to your question but that might be helpful:
All Raku functions return their final expression, so the return statement in parse-file isn't necessary (and it's actually slightly slower/non-idiomatic).
A big part of the power of gather/take is that they can cross function boundaries. That is, you can have a parse-file function that takes different lines without needing to have the gather statement inside parse-lines – you just need to call parse-lines within the scope of a gather block. This feels like it might be helpful in solving the problem you're working on, though it's hard to be sure without more info.

First of all - I had big misconception. I thought that all lines produced by parse-file must be slipped into map block like this:
#file-names.map( produce all lines here ).map( process all lines here );
And Slip is a List that tracks all elements. That is why I had big memory leak.
The solution is to create gather-take sequence inside map but consume it outside map:
#file-names.map( { parse-file( $_ ) } ).flat.map( { process-line( $_ ) } );
So now it is:
#file-names.map( construct sequence here ).(get items from sequence here).map( process all lines here );

Related

Perl: Multidimentional arrays and "experimental push" error

I'm a junior perl programmer and have very little experience with multidimentional arrays. I'm not even sure it is the proper data structure for this project.
I have an array of clients that is read in from a file:
my #clientlist = grep(/[A-Z]\w+$/,readdir(DIR));
It produces a list like:
$VAR1 = [
'AA14A',
'BB12R',
'CC34M'
];
Each client has some unknown number of elements read from another file that correspond to the client name like:
__U_AA14A_0001, __U_AA14A_0002, __U_AA14A_0003
__U_BB12R_0001, __U_BB12R_0002, __U_BB12R_0003
When I try to assign the corresponding element to the client name:
my #allclients;
my $header = $string;
my $i = 0; # index in array
foreach my $client (#clientlist) {
push #allclients{$client}{$i}, $header;
$i += 1;
}
it prints:
Useless use of push with no values at ./convert_domains.pl line 97.
Global symbol "%allclients" requires explicit package name (did you forget to declare
"my %allclients"?) at ./convert_domains.pl line 97.
Experimental push on scalar is now forbidden at ./convert_domains.pl line 97, near "}
{"
syntax error at ./convert_domains.pl line 97, near "}{"
I've also tried numerous variations to the push() function, but they all return some variation of the above.
I'm trying to build something like:
AA14A, __U_AA14A_0001, __U_AA14A_0002, __U_AA14A_0003
BB12R, __U_BB12R_0001, __U_BB12R_0002, __U_BB12R_0003
so I can iterate through it and print out the individual elements.
My main questions are how to properly access a multi-dimentional array of strings. I've also read this perldoc but this doesn't seem to work.
https://perldoc.perl.org/perllol
First of all,
my #allclients;
should be
my %allclients;
because you want an associative array (i.e. an array-like structure keyed by strings) and thus a hash.
Also,
push #allclients{$client}{$i}, ...;
should be
$allclients{$client}[$i] = ...;
or
push #{ $allclients{$client} }, ...;
or
push $allclients{$client}->#*, ...;
You want to add to the array referenced by $allclients{$client}, so #{ $allclients{$client} } or $allclients{$client}->#*. See Perl Dereferencing Syntax.
Yes, you never explicitly created any of the multiple arrays and the references to them, but that's not a problem thanks to autovivification.

what's the proper way to allow users to provide a string "mangler" as a regex/proc/expr/

In my Tcl/Tk project, i need to allow my users to mangle a string in a well-defined way.
The idea is, to allow people to declare a "string mangling" proc/expr/function/... in a configuration file, which then gets applied to the strings in question.
I'm a bit worried on how to properly implement that.
Possibilities I have considered so far:
regular expressions
That was my first thought, but there's two caveats:
search/replace with regular expressions in Tcl seems to be awkward. at least with regsub i need to pass the match and replacement parts separately (as opposed to how e.g. sed allows me to pass a single complicated string that does everything for me); there are sed implementations for Tcl, but they look naive and might break rather sooner than later
also regexes can be awkward by themselves; using them to mangle complicated strings is often more complicated than it should be
procs?
Since the target platform is Tcl anyhow, why not use the power of Tcl to do string mangling?
The "function" should have a single input and produce a single output, and ideally it the user should be nudged into doing it right (e.g. not being able to define a proc that requires two arguments) and it be (nigh) impossible to create side-effects (like changing the state of the application).
A simplistic approach would be to use proc mymangler s $body (with $body being the string defined by the user), but there are so many things that can go wrong:
$body assuming a different arg-name (e.g. $x instead of $s)
$body not returning anything
$body changing variables,... in the environment
expressions look more like it (always returning things, not allowing to modify the environment easily), but i cannot make them work on strings, and there's no way to pass a variable without agreeing its name.
So, the best I've come up with so far is:
set userfun {return $s} # user-defined string
proc mymangler s ${userfun}
set output [mymangler $input]
Are there better ways to achieve user-defined string-manglers in Tcl?
You can use apply -- the user provides a 2-element list: the second element is the "proc body", the code that does the mangling; the first element is the variable name to hold the string, this variable is used in the body.
For example:
set userfun {{str} {string reverse $str}}
set input "some string"
set result [apply $userfun $input] ;# => "gnirts emos"
Of course the code you get from the user is any arbitrary Tcl code. You can run it in a safe interpreter:
set userfun {{str} {exec some malicious code; return [string reverse $str]}}
try {
set interp [safe::interpCreate]
set result [$interp eval [list apply $userfun $input]]
puts "mangled string is: $result"
safe::interpDelete $interp
} on error e {
error "Error: $e"
}
results in
Error: invalid command name "exec"
Notes:
a standard Tcl command is used, apply
the user must specify the variable name used in the body.
this scheme does protect the environment:
set userfun {{str} {set ::env(SOME_VAR) "safe slave"; return $str$str}}
set env(SOME_VAR) "main"
puts $env(SOME_VAR)
try {
set interp [safe::interpCreate]
set result [$interp eval [list apply $userfun $input]]
puts "mangled string is: $result"
safe::interpDelete $interp
} on error e {
error "Error: $e"
}
puts $env(SOME_VAR)
outputs
main
mangled string is: some stringsome string
main
if the user does not return a value, then the mangled string is simply the empty string.
The "simplistic" approach is like foreach in that it requires the user to supply a variable name and a script to evaluate that uses that variable, and is a good approach. If you don't want it affecting the rest of the program, run it in a separate interpreter:
set x 0
proc mymangler {name body} {
set i [interp create -safe]
set s "some string to change"
try {
# Build the lambda used by apply here instead of making
# the user do it.
$i eval [list apply [list $name $body] $s]
} on error e {
return $e
} finally {
interp delete $i
}
}
puts [mymangler s { set x 1; string toupper $s }]
puts $x
outputs
SOME STRING TO CHANGE
0
If the person calling this says to use s as a variable and then uses something else in the body, it's on them. Same with providing a script that doesn't return anything.
I'd generally allow the user to specify a command prefix as a Tcl list (most simple command names are trivially suitable for this), which you would then apply to the argument by doing:
set mangled [{*}$commandPrefix $valueToMangle]
This lets people provide pretty much anything they want, especially as they can use apply and a lambda term to mangle things as required. Of course, if you're in a procedure then you're probably actually better off doing:
set mangled [uplevel 1 [list {*}$commandPrefix $valueToMangle]]
so that you're running in the caller's context (change 1 to #0 to use the global context instead) which can help protect your procedure against accidental changes and make using upvar within the mangler easier.
If the source of the mangling prefix is untrusted (what that means depends greatly on your application and deployment) then you can run the mangling code in a separate interpreter:
# Make the safe evaluation context; this is *expensive*
set context [interp create -safe]
# You might want to let them define extra procedures too
# interp invokehidden $context source /the/users/file.tcl
# Use the context
try {
set mangled [interp eval $context [list {*}$commandPrefix $valueToMangle]]
} on error {msg} {
# User supplied something bad; error message in $msg
}
There's various ways to support users specifying the transformation, but if you can expose the fact that you're working with Tcl to them then that's probably easiest and most flexible.

How to check if the first variable passed into a method is a string. Perl

I have no idea how to check for this. My method(if condition in method) should only work (execute) if the first argument passed in is a string. I know how to check other types, but I can't seem to find anything for checking for a string.
For a hash I would do something like;
if(ref eq 'HASH') {...}
If someone could provide a simple example I'm sure I would be able to apply it to what I'm doing. I will put up the code for the method and an explanation for the whole operational details of the method if needed.
Added Information
This is a method for handling different types of errors in the software, here are the 3 possible input formats:
$class->new("error string message")
$class->new("error string message", code => "UNABLE_TO_PING_SWITCH_ERROR")
$class->new("error string message", code => "UNABLE_TO_PING_SWITCH_ERROR", switch_ip => $ip3, timeout => $timeout)
There will always be an error message string first.
With the 1st case there is also a hashref to an error hash structure that is located in a library,
this method new will go into a template processing if the word "code" exists as an arg. where the longer detailed error message is constructed. (I already have the logic for this).
But I have to add logic so that the error message string is added to the hash, so the output is one hash, and not strings.
The second case is very similar to the first, where there are parameters eg. switch_ip , which are inserted into the string using a similar template processing logic, (already have this too).
So I think the first and second cases can be handled in the same way, but I'm not sure, so separated them in this question.
The last case is just can error message string by itself, which at the minute I just insert it into a one key message hash { message => "error string}.
So after all that how should I be checking or dividing up these error cases, At the minute my idea for the ones with code , is to dump the arguments into a hash and just use something like:
if(exists($param{code}) { doTemplateProcess()...}
I need to ensure that there is a string passed in first though. Which was my original question. Does any of my context information help? I hope I didn't go off the topic of my question, if so I'll open this a new question. Thanks.
Error hash - located in Type.pm
use constant ERROR_CODE => {
UNABLE_TO_PING_SWITCH_ERROR => {
category => 'Connection Error:',
template => 'Could not ping switch %s in %s minutes',
tt => {template => 'disabled'},
fatal => 1,
wiki_page => www.error-solution.com/,
},
}
From comments:
These will be called in the software's code like so
ASC::Builder::Error->new(
"Phase x this occured because y was happening:",
code => UNABLE_TO_PING_SWITCH_ERROR,
switch_ip => $ip3,
timeout => 30,
);
Putting the wisdom of your particular problem aside and channeling Jeff Foxworthy:
If you have a scalar and it's not a reference, you might have a string.
If your non-reference scalar doesn't look like a number, it might be a string.
If your non-reference scalar looks like a number, it can still be a string.
If your non-reference scalar has a different string and number value, it might be a dualvar.
You know that your argument list is just that: a list. A list is a collection of scalar values. A scalar can be a reference or not a reference. I think you're looking for the not a reference case:
die "You can't do that" if ref $first_argument;
Past that, you'd have to do fancier things to determine if it's the sort of value that you want. This might also mean that you reject objects that pretend to be strings through overloading and whatnot.
Perhaps you can make the first argument part of the key-value pairs that you pass. You can then access that key to extract the value and delete it before you use the remaining pairs.
You may easily check only whether the error string is a simple scalar value or a reference. You would do that with ref, but you must consider what you want to do if the first parameter isn't a string
You should write your constructor in the ASC::Builder::Error package along these lines
sub new {
my $class = shift;
my ($error, %options) = #_;
die if ref $error;
bless { string => $error }, $class;
}
This example simply dies, and so kills the program, if it is called with anything other than a simple string or number as the first parameter
You may call it as
ASC::Builder::Error->new('error')
or
ASC::Builder::Error->new(42)
and all will be well. If you try
ASC::Builder::Error->new('message', 'code')
then you will see a warning
Odd number of elements in hash assignment
And you may make that warning fatal
If there is anything more then you should explain
Supporting all of the following is simple:
$class->new("s")
$class->new("s", code => "s")
$class->new("s", code => "s", switch_ip => "s", timeout => "s")
All you need is the following:
sub new {
my ($class, $msg, %opts) = #_;
...
}
You can checks such as the following to examine what the called provided:
if (exists($opts{code}))
if (defined($opts{code}))
if ($opts{code})
Despite saying that the string will always be provided, you now ask how to check if was provided. As such, you are probably trying to perform validation rather than polymorphism. You shouldn't waste your time doing this.
Let's look at the hash reference example you gave. ref($arg) eq 'HASH' is wrong. That returns false for some hash references, and it returns false for some things that act like a reference to a hash. The following is a more proper check:
eval { %$arg; 1 }
The equivalent for strings would be the following:
eval { "$arg"; 1 }
Unfortunately, it will always return true! Every value can act as a string. That means the best thing you can do is simply to check if any argument is provided.
use Carp qw( croak );
croak("usage") if !#_;
It's rare for Perl subs to perform argument validation. Not only is it tricky, it's also expensive. It also provides very little benefits. Bad or missing arguments usually results in exceptions or warnings shortly after.
You might see suggestions to use croak("usage") if ref($arg); (or worse, die if ref($arg);), but keep in mind that those will cause the rejection of perfectly fine objects that overload stringification (which is somewhat common), and they will fail to detect the problem with ASC::Builder::Error->new(code => ...) because code produces a string. Again, performing type-based argument validation is an expensive and buggy practice in Perl.

File handling (.conf file) in perl

I need to read this .conf file. This file cannot be read by any module such as Config::Tiny, Config::INI, Config::Simple and etc.
Here is the .conf file, let's say the file name is conference.conf :
[ConferenceId] #section
1000 #value
2000
3000
4000
[RadioExt]
1000=102 #parameter and value
2000=202
3000=302
4000=402
What i want is that the perl read the value only and not the section and print it out to the user. I'm still new at perl. I just learned perl for a week. This type of task make me harder to have any progress in reading, writing, appending on configuration file.
What i also want is that the value in [ConferenceId] declared as global. As the value in the [ConferenceId] changed, the other parameters in [RadioExt] also changed. For example,
[ConferenceId]
1100 #the values has been changed
2100
3100
4100
[RadioExt]
1100=102 #parameters also changed
2100=202
3100=302
4100=402
Can anybody help me with this? I know it is such a big favor. But i really needed this so that i can learn more about reading, writing and appending configuration file. Thanks.
The real answer to this is to use Config::Tiny.
However, since this is an learning exercise assigned by your teacher, I will point you at all of the perlfaq5: How do I change, delete, or insert a line in a file, or append to the beginning of a file?. That should demonstrate all of the standard ways to manipulate a file.
If it wasn't also a module, I'd recommend using the core library Tie::File for this problem, but that's probably not your teachers intent.
So my final recommendation is to take a look at the source for Config::Tiny. It's likely to be outside of your skill set, but ideally you would be able to read this entire file by the end of your course. And this problem does not take a complicated bit of code. Looking at how others have solved problems can be oen of the best ways to learn, especially if you're able to recognize which are the better modules.
Update
Config::Tiny alone will not be able to a parse your file, because it's not a strict ini file format. The fact that one of your sections has raw values without keys won't work with any of the standard modules.
Below is an example of how to parse your file using regular expressions. Probably should be enhanced with additional error checking to make sure key/value pairs aren't mixed with array values, but this should get you started:
use strict;
use warnings;
my %hash;
my $section;
while (<DATA>) {
chomp;
next if /^\s*$/;
# Begin Section
if (/^\s*\[(.*)\]\s*$/) {
$section = $1;
# Hash Key & Value
} elsif (/^(.*?)=(.*)/) {
$hash{$section}{$1} = $2;
# Array
} else {
push #{$hash{$section}}, $_;
}
}
use Data::Dump;
dd \%hash;
__DATA__
[ConferenceId]
1000
2000
3000
4000
[RadioExt]
1000=102
2000=202
3000=302
4000=402
Outputs:
{
ConferenceId => [1000, 2000, 3000, 4000],
RadioExt => { 1000 => 102, 2000 => 202, 3000 => 302, 4000 => 402 },
}
Even though the file extension is .conf you should be able to read it like any other text file.
you can try this
$file = "<yourfilename here>";
open(FH,$file);
while(<FH>)
{
$line = $_;
# here you can write your logic
}

Monitoring Process of Cases[] on a Very Large Body of Information

I'm currently undertaking operations on a very large body of text (~290MB of plain text in one file). After importing it into Mathematica 8, I'm currently beginning operations to break it down into lowercase words, etc. so I can begin textual analysis.
The problem is that these processes take a long time. Would there be a way to monitor these operations through Mathematica? For operations with a variable, I've used ProgressIndicator etc. But this is different. My searching of documentation and StackOverflow has not turned up anything similar.
In the following, I would like to monitor the process of the Cases[ ] command:
input=Import["/users/USER/alltext.txt"];
wordList=Cases[StringSplit[ToLowerCase[input],Except[WordCharacter]],Except[""]];
Something like StringCases[ToLowerCase[input], WordCharacter..] seems to be a little faster. And I would probably use DeleteCases[expr, ""] instead of Cases[expr, Except[""]].
It is possible to view the progress of the StringSplit and Cases operations by injecting "counter" operations into the patterns being matched. The following code temporarily shows two progress bars: the first showing the number of characters processed by StringSplit and the second showing the number of words processed by Cases:
input = ExampleData[{"Text", "PrideAndPrejudice"}];
wordList =
Module[{charCount = 0, wordCount = 0, allWords}
, PrintTemporary[
Row[
{ "Characters: "
, ProgressIndicator[Dynamic[charCount], {0, StringLength#input}]
}]]
; allWords = StringSplit[
ToLowerCase[input]
, (_ /; (++charCount; False)) | Except[WordCharacter]
]
; PrintTemporary[
Row[
{ "Words: "
, ProgressIndicator[Dynamic[wordCount], {0, Length#allWords}]
}]]
; Cases[allWords, (_ /; (++wordCount; False)) | Except[""]]
]
The key to the technique is that the patterns used in both cases match against the wildcard _. However, that wildcard is guarded by a condition that always fails -- but not until it has incremented a counter as a side effect. The "real" match condition is then processed as an alternative.
It depends a little on what your text looks like, but you could try splitting the text into chunks and iterate over those. You could then monitor the iterator using Monitor to see the progress. For example, if your text consists of lines of text terminated by a newline you could do something like this
Module[{list, t = 0},
list = ReadList["/users/USER/alltext.txt", "String"];
Monitor[wordlist =
Flatten#Table[
StringCases[ToLowerCase[list[[t]]], WordCharacter ..],
{t, Length[list]}],
Labeled[ProgressIndicator[t/Length[list]], N#t/Length[list], Right]];
Print["Ready"]]
On a file of about 3 MB this took only marginally more time than Joshua's suggestion.
I don't know how Cases works, but List processing can be time consuming, especially if it is building the List as it goes. Since there is an unknown number of terms present in the processed expression, it is likely that is what is occurring with Cases. So, I'd try something slightly different: replacing "" with Sequence[]. For instance, this List
{"5", "6", "7", Sequence[]}
becomes
{"5", "6", "7"}.
So, try
bigList /. "" -> Sequence[]
it should operate faster as it is not building up a large List from nothing.

Resources