Writing a persistent perl script

Writing a persistent perl script - linux

I am trying to write a persistent/cached script. The code would look something like this:
...
Memoize('process_fille');
print process_file($ARGV[0]);
...
sub process_file{
my $filename = shift;
my ($a, $b, $c) = extract_values_from_file($filename);
if (exists $my_hash{$a}{$b}{$c}){
return $my_hash{$a}{$b}{$c};
}
return $default;
}
Which would be called from a shell script in a loop as follows
value=`perl my_script.pl`;
Is there a way I could call this script in such a way that it will keep its state. from call to call. Lets assume that both initializing '%my_hash' and calling extract_values_from_file is an expensive operation.
Thanks

This is kind of dark magic, but you can store state after your script's __DATA__ token and persist it.
use Data::Dumper; # or JSON, YAML, or any other data serializer
package MyPackage;
my $DATA_ptr;
our $state;
INIT {
$DATA_ptr = tell DATA;
$state = eval join "", <DATA>;
}
...
manipulate $MyPackage::state in this and other scripts
...
END {
open DATA, '+<', $0; # $0 is the name of this script
seek DATA, $DATA_ptr, 0;
print DATA Data::Dumper::Dumper($state);
truncate DATA, tell DATA; # in case new data is shorter than old data
close DATA;
}
__DATA__
$VAR1 = {
'foo' => 123,
'bar' => 42,
...
}
In the INIT block, store the position of the beginning of your file's __DATA__ section and deserialize your state. In the END block, you reserialize the current state and overwrite the __DATA__ section of your script. Of course, the user running the script needs to have write permission on the script.
Edited to use INIT block instead of BEGIN block -- the DATA block is not set up during the compile phase.

If %my_hash in your example have moderate size in its final initialized state, you can simply use one of serialization modules like Storable, JSON::XS or Data::Dumper to keep your data in pre-assembled form between runs. Generate a new file when it is absent and just reload ready content from there when it is present.
Also, you've mentioned that you would call this script in loops. A good strategy would be to not call script right away inside the loop, but build a queue of arguments instead and then pass all of them to script after the loop in single execution. Script would set up its environment and then loop over arguments doing its easy work without need to redo setup steps for each of them.

You can't get the script to keep state. As soon as the process exists any information not written to disk is gone.
There are a few ways you can accomplish this though:
Write a daemon which listens on a network or unix socket. The daemon can populate my_hash and answer questions sent from a very simple my_script.pl. It'd only have to open a connection to the daemon, send the question and return an answer.
Create an efficient look-up file format. If you need the information often it'll probably stay in the VFS cache anyway.
Set up a shared memory region. The first time your scripts starts you save the information there, then re-use it later. That might be tricky from a Perl script though.

No. Not directly but can be achieved by very many ways.
1) I understand **extract_values_from_file()** parses given file returning hash.
2) 1 can be made as a script, then dump the parsed hash using **Data::Dumper** into file.
3) When running my_script.pl, ensure that file generated by 2 is later than of the config file. Can achieve this via **make**
3.1) **use** the file generated by 2 to retrieve values.
The same can be achieved via freeze/thaw

Related

What is the most efficient way to keep writing a frequently changing JavaScript object to a file in NodeJS?

I have a JavaScript object with many different properties, and it might look something like this:
var myObj = {
prop1: "val1",
prop2: [...],
...
}
The values in this object keep updating very frequently (several times every second) and there could be thousands of them. New values could be added, existing ones could be changed or removed.
I want to have a file that always has the updated version of this object. The simple approach for doing this would be just writing the entire object to the file all over again after each time that it changes like so:
fs.writeFileSync("file.json", JSON.stringify(myObj));
This doesn't seem very efficient for big objects that need to be written very frequently. Is there a better way of doing this?

You should use a database. Something simple like sqlite3 would be a good option. Have a table with just two columns 'Key' 'Value' and use it as a key value store. You will gain advantages like transactions and better performance than a file as well as simplifying your access.

Maintaining a file (on the filesystem) containing the current state of a rapidly changing object is surprisingly difficult. Specifically, setting things up so some other program can read the file at any time is the hard part. Why? At any time the file may be in the process of being written, so the reader can get inconsistent results.
Here's an outline of a good way to do this.
1) write the file less often than each time the state changes. Whenever the state changes call updateFile (myObj). It sets a timer for, let's say, 500ms, then writes the very latest state to the file when the timer expires. Something like this: not debugged:
let latestObj
let updateFileTimer = 0
function updateFile (myObj) {
latestObj = myObj
if (updateFileTimer === 0) {
updateFileTimer = setTimeout (
function () {
/* write latestObj to the file */
updateFileTimer = 0
}, 500)
}
}
This writes the latest state of your object to the file, but no more than every 500ms.
Inside that timeout function, write out a temporary file. When it's written delete the existing file and rename the temp file to have the existing file's name. Do all this asynchronously so the rest of your program won't have to wait for the filesystem to work. Your timeout function will look like this
updateFileTimer = setTimeout (
function () {
/* write latestObj to the file */
fs.writeFile("file.json.tmp",
JSON.stringify(myObj),
function (err) {
if (err) throw err;
fs.unlink ( "file.json",
function (err) {
if (!err)
fs.renameSync( "file.json.tmp", "file.json")
} )
} )
updateFileTimer = 0
}, 500)
There's one more thing to worry about. There's a brief period of time between the unlink and the renameSync operation where the "file.json" file does not exist in the file system. So, any program you write that READs "file.json" needs to try again if the file isn't found.
If you use a Linux, MacOs, FreeBSD, or other UNIX-derived operating system for this code it will work well. Those operating systems' file systems allow one program to unlink a file while another program is reading it. If you're running it on a DOS-derived operating system like Windows, the unlink operation will fail when another program is reading the file.

How to statically analyse that a file is fit for importing?

I have CLI program that can be executed with a list of files that describe instructions, e.g.
node ./my-program.js ./instruction-1.js ./instruction-2.js ./instruction-3.js
This is how I am importing and validating that the target file is an instruction file:
const requireInstruction = (instructionFilePath) => {
const instruction = require(instructionFilePath)
if (!instruction.getInstruction) {
throw new Error('Not instruction file.');
}
return instruction;
};
The problem with this approach is that it will execute the file executes regardless of whether it matches the expected signature, i.e. if file contains a side action such as connecting to the database:
const mysql = require('mysql');
mysql.createConnection(..);
module.exports = mysql;
Not instruction file. will fire, I will ignore the file, but the side-action will remain in the background.
How to safely validate target file signature?
Worst case scenario, is there a conventional way to completely sandbox the require logic and kill the process if file is determined to be unsafe?

Worst case scenario, is there a conventional way to completely sandbox the require logic and kill the process if file is determined to be unsafe?
Move the check logic into a specific js file. Make it process.exit(0) when everything is fine, process.exit(1) when it s wrong.
In your current program, instead of loading the file via require, use child_process.exec to invoke your new file, giving it the required parameter to know which file to test.
In your updated program, bind the close event to know if the return code was 0 or 1.
If you need more information than 0 or 1, into the new js file which will load the instruction, print some JSON.stringified data to stdout (console.log), and retrieve then JSON.parse it in the callback of call to child_process.exec.
Alternatively, have you looked into AST processing ?
http://jointjs.com/demos/javascript-ast
It could help you to identify piece of code which are not embedded within an exported function.

(Note: I discussed this question with the author on IRC. There may be some context in my answer that isn't in the original question.)
Given that your scenario is purely about preventing against accidental inclusion of non-instruction files, rather than about preventing malicious behaviour, static analysis using something like Esprima will probably be sufficient.
One approach would be to require that every instruction file exports some kind of object with a name property, containing the name of the instruction file. As there's not really anything to put in there besides a string literal, you can be fairly certain that if you can't locate a name property through static analysis, the file is not an instruction file - even in a language like JavaScript that isn't fully statically analyzable.
For any readers of this thread that are trying to protect from malicious actors, rather than accidents - for example, when accepting untrusted code from users: you cannot sandbox or 'validate' JavaScript with Node.js alone (not with the vm module either), and the above solution will not work for you. You will need system-level containerization or virtualization to run this kind of code safely. There are no other options.

node.js multithreading with max child count

I need to write a script, that takes an array of values and multithreaded way it (forks?) runs another script with a value from array as a param, but so max running forks would be set, so it would wait for script to finish if there are more than n running already. How do I do that?
There is a plugin named child_process, but not sure how to get it done, as it always waits for child termination.
Basically, in PHP it would be something like this (wrote it from head, may contain some syntax errors):
<php
declare(ticks = 1);
$data = file('data.txt');
$max=20;
$child=0;
function sig_handler($signo) {
global $child;
switch ($signo) {
case SIGCHLD:
$child -= 1;
}
}
pcntl_signal(SIGCHLD, "sig_handler");
foreach($data as $dataline){
$dataline = trim($dataline);
while($child >= $max){
sleep(1);
}
$child++;
$pid=pcntl_fork();
if($pid){
// SOMETHING WENT WRONG? NEVER HAPPENS!
}else{
exec("php processdata.php \"$dataline\"");
exit;
}//fork
}
while($child != 0){
sleep(1);
}
?>

After the conversation in the comments, here's how to have Node executing your PHP script.
Since you're calling an external command, there's no need to create a new thread. The Node.js runloop understands that calls to external commands are async operations, and it can execute all of them at the same time.
You can see different ways for executing an external process in this SO question (linked answer may be the best in your case).
However, since you're already moving everything to Node, you may even consider rewriting your "process.php" script to Node.js code. Since, as you explained, that script connects to remote servers and databases and uses nslookup (which you may not really need with Node.js), you won't need any separate thread: they're all async operations that Node.js excels at performing.

is it possible to call lua functions defined in other lua scripts in redis?

I have tried to declare a function without the local keyword and then call that function from anther script but it gives me an error when I run the command.
test = function ()
return 'test'
end
# from some other script
test()
Edit:
I can't believe I still have no answer to this. I'll include more details of my setup.
I am using node with the redis-scripto package to load the scripts into redis. Here is an example.
var Scripto = require('redis-scripto');
var scriptManager = new Scripto(redis);
scriptManager.loadFromDir('./lua_scripts');
var keys = [key1, key2];
var values = [val];
scriptManager.run('run_function', keys, values, function(err, result) {
console.log(err, result)
})
And the lua scripts.
-- ./lua_scripts/dict_2_bulk.lua
-- turns a dictionary table into a bulk reply table
dict2bulk = function (dict)
local result = {}
for k, v in pairs(dict) do
table.insert(result, k)
table.insert(result, v)
end
return result
end
-- run_function.lua
return dict2bulk({ test=1 })
Throws the following error.
[Error: ERR Error running script (call to f_d06f7fd783cc537d535ec59228a18f70fccde663): #enable_strict_lua:14: user_script:1: Script attempted to access unexisting global variable 'dict2bulk' ] undefined

I'm going to be contrary to the accepted answer, because the accepted answer is wrong.
While you can't explicitly define named functions, you can call any script that you can call with EVALSHA. More specifically, all of the Lua scripts that you have explicitly defined via SCRIPT LOAD or implicitly via EVAL are available in the global Lua namespace at f_<sha1 hash> (until/unless you call SCRIPT FLUSH), which you can call any time.
The problem that you run into is that the functions are defined as taking no arguments, and the KEYS and ARGV tables are actually globals. So if you want to be able to communicate between Lua scripts, you either need to mangle your KEYS and ARGV tables, or you need to use the standard Redis keyspace for communication between your functions.
127.0.0.1:6379> script load "return {KEYS[1], ARGV[1]}"
"d006f1a90249474274c76f5be725b8f5804a346b"
127.0.0.1:6379> eval "return f_d006f1a90249474274c76f5be725b8f5804a346b()" 1 "hello" "world"
1) "hello"
2) "world"
127.0.0.1:6379> eval "KEYS[1] = 'blah!'; return f_d006f1a90249474274c76f5be725b8f5804a346b()" 1 "hello" "world"
1) "blah!"
2) "world"
127.0.0.1:6379>
All of this said, this is in complete violation of spec, and is entirely possible to stop working in strange ways if you attempt to run this in a Redis cluster scenario.

Important Notice: See Josiah's answer below. My answer turns out to be wrong or at the least incomplete. Which makes me very happy ofcourse, it makes Redis all the more flexible.
My incorrect/incomplete answer:
I'm quite sure this is not possible. You are not allowed to use global variables (read the docs ), and the script itself gets a local and temporary scope by the Redis Lua engine.
Lua functions automatically set a 'writing' flag behind the scenes if they do any write action. This starts a transaction. If you cascade Lua calls, the bookkeeping in Redis would become very cumbersome, especially when the cascade is executed on a Redis slave. That's why EVAL and EVALSHA are intentionally not made available as valid Redis calls inside a Lua script. Same goes for calling an already 'loaded' Lua function which you are trying to do. What would happen if the slave is rebooted between the load of the first script and the exec of the second script?
What we do to overcome this limitation:
Don't use EVAL, only use SCRIPT LOAD and EVALSHA.
Store the SHA1 inside a redis hash set.
We automated this in our versioning system, so a committed Lua script automatically gets it's SHA1 checksum stored in the Redis master, in a hash set, with a logical name. The clients can't use EVAL (on a slave; we disabled EVAL+LOAD in config). But the client can ask for the SHA1 for the next step. Almost all our Lua functions return a SHA1 for the next call.
Hope this helps, TW

Because I'm not one to leave well enough alone, I built a package that allows for simple internal calling semantics. The package (for Python) is available on GitHub.
Long story short, it uses ARGV as a call stack, translates KEYS/ARGV references to _KEYS and _ARGV, uses Redis as a name -> hash mapping internally, and translates CALL.<name>(<keys>, <argv>) to a table append + Redis lookup + Lua function call.
The METHOD.txt file describes what goes on, and all of the regular expressions I used to translate the Lua scripts are available in lua_call.py. Feel free to re-use my semantics.
The use of the function registry makes this very unlikely to work in Redis cluster or any other multi-shard setup, but for single-master applications, it should work for the foreseeable future.

Using semaphores while modyifing file in perl

I'm kind of newbie to threads in perl.
I have a file with a list of projects (each project is in a separate line), and i want to build those projects in parallel.
Currently, each thread:
opens the file as "read" mode
saves a list of some projects (= some file lines)
closes the file
opens the file again- as "write" mode
rewrites it without the lines that were selected
in order to make sure each thread is the only one to access the file, im trying to use semaphore.
for some reason, threads collisons are occurred, and i can't figure out what am i doing wrong.
i can see (in my "REPORT" which also gets the current time for each build)
that deifferent threads select the same projects from the "shared" file (it happens only once in a while, but still..)
i'm not even sure if my $semaphore decleration is legal as "my" variable.
Any help would be truly appreciated!!
Thanks.
here's a part of my code:
my $semaphore = Thread::semaphore->new() ;
sub build_from_targets_list{
#...
open(REPORT, "+>$REPORT_tmp"); # Open for output
#....
#threads =();
for ($i = 0; $i < $number_of_cores; $i++){
my $thr = threads->new(\&compile_process, $i,*REPORT);
push #threads, $thr;
}
$_->join for #threads;
close (REPORT);
}
### Some stuff..
sub compile_process{
*REPORT = shift(#_);
#...
while (1){
$semaphore->down();
open (DATA_FILE, $shared_file);
flock(DATA_FILE, 2);
while ($record = <DATA_FILE>) {
chomp($record);
push(#temp_target_list,$record);
}
# ... choose some lines (=projects)...
# remove the projects that should be built by this thread:
for ($k = 0; $k < $num_of_targets_in_project; $k++){
shift(#temp_target_list);
}
close(DATA_FILE);
open (REWRITE,">$shared_file");
flock(REWRITE, 2);
seek(REWRITE, 0, 0);
foreach $temp_target (#temp_target_list){
print REWRITE "$temp_target\n";
}
close (REWRITE);
## ... BUILD selected projects...
$semaphore->up();
}
}

First, some basic cleanup of how you're dealing with files. No point in trying to debug a thread problem if it's a simple file issue.
One must check that any file commands (open, close, flock, seek, etc...) succeed. Either stick some or dies on there or use autodie.
Second is the use of a hard coded constant for flock. Those are system dependent, and its hard to remember which mode 2 is. Fcntl provides the constants.
You're opening the data file for reading with an exclusive lock (2 is usually exclusive lock). That should probably be a shared lock. This would be unlikely to cause a problem, but it will cause your threads to block unnecessarily.
Finally, use lexical filehandles instead of a globally scoped glob. This reduces the chance
use Fcntl qw(:flock);
use autodie;
open (my $data_fh, $shared_file);
flock($data_fh, LOCK_SH);
As a side note, the seek $fh, 0, 0 after opening a file for writing is unnecessary. Same goes for seek constants as for flock, use Fcntl to get the constants.
An additional bug is that you're passing in $i, *REPORT but compile_process thinks *REPORT is the first argument. And again the use of global filehandles means that passing it in is redundant, use lexical filehandles.
Now that's out of the way, your basic algorithm seems flawed. compile_process has each thread reading in the whole data file into the thread local array #temp_target_list, shifting some off of that local array and writing the rest out. Because #temp_target_list is per thread, there's no coordination. Unless $num_of_targets_in_project is shared and doing some sort of off screen coordination, but that's not shown.
File based locking is always going to be a little slice of hell. Threads have much better mechanisms for coordination. There's a much easier way to do this.
Assuming the file isn't too large, read each line into a shared array. Then have each thread take items to work on from that array. The array is shared, so as each element is removed the array will update for all the threads. Something like...
use strict;
use warnings;
use autodie;
use threads;
use threads::shared;
my $Max_Threads = 5;
my #Todo : shared;
open my $fh, "<", $work_file;
#Todo = <$fh>;
close $fh;
my #threads;
for (1..$Max_Threads) {
push #threads, threads->new(\&compile_process);
}
$_->join for #threads;
sub compile_process {
while( my $work = shift #Todo ) {
...do whatever with $work...
}
}
If the file is too large to be held in memory, you can use Thread::Queue to build a queue of work items and add to it dynamically.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Writing a persistent perl script - linux

Related

What is the most efficient way to keep writing a frequently changing JavaScript object to a file in NodeJS?

How to statically analyse that a file is fit for importing?

node.js multithreading with max child count

is it possible to call lua functions defined in other lua scripts in redis?

Using semaphores while modyifing file in perl

Categories

Resources